Home › Knowledge Base › Direct Score Optimization

Direct Score Optimization

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

arXiv:2604.25702v2 Announce Type: replace Abstract: Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes.

arXiv CS 8d ago

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

arXiv:2606.09076v1 Announce Type: new Abstract: Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization...

arXiv CS 1d ago

Single-Line Drawing Generation via Semantics-Driven Optimization

arXiv:2606.01910v1 Announce Type: new Abstract: Line drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve,...

arXiv CS 8d ago

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

arXiv:2606.03238v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective with learned and scalable proxies. The same substitution creates a structured failure surface: optimization can raise the learned reward while external quality falls, degrade both proxy and judge scores, reveal proxy under-alignment, or produce evaluator-specific disagreement. We present an empirical...

arXiv CS 7d ago

Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization

Announce Type: replace Abstract: Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines...

arXiv CS 5d ago

Gryphon: A Unified Architecture for Semantic-ID Generation and Item-Level Scoring in Industrial Recommendations

arXiv:2606.08604v1 Announce Type: new Abstract: Generative retrieval (GR) has become a scalable approach to candidate generation: each item is assigned a short hierarchical token sequence called a Semantic ID (SID), and the next item's SID is decoded autoregressively. A practical limitation is that the decoder's beam search optimizes the likelihood of token sequences, not the relevance of the underlying items. These objectives diverge when sequence likelihood is poorly calibrated due to beam...

arXiv CS 1d ago

Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment

arXiv:2605.30638v1 Announce Type: new Abstract: We introduce Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment for general families of differentiable losses. Error broadcast is a biologically plausible alternative to backpropagation that sends output information to hidden layers without weight transport. The Error Broadcast and Decorrelation (EBD) framework, recently introduced for the mean-squared-error (MSE) setting, grounded this...

arXiv CS 9d ago

Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

arXiv:2506.02018v2 Announce Type: replace Abstract: Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and...

arXiv CS 7d ago

You Can Learn Tokenization End-to-End with Reinforcement Learning

Announce Type: replace Abstract: Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of...

arXiv CS 8d ago

A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

arXiv:2511.19829v2 Announce Type: replace Abstract: Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing query-dependent approaches rely on unstable textual feedback or black-box reward models, providing weak and uninterpretable optimization signals. More fundamentally, prompt quality itself lacks a unified, systematic definition, resulting in fragmented and unreliable evaluation signals.

arXiv CS 8d ago