Home › Knowledge Base › Quality for Language Preference in

Quality for Language Preference in

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

arXiv:2509.13930v3 Announce Type: replace Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. Despite their growing use, an open questions is whether the mixture of different document languages impacts generation and citation behavior in unintended ways. To investigate this, we introduce a controlled methodology using model internals to measure language preference...

arXiv CS 1d ago

Differentially Private Preference Data Synthesis for Large Language Model Alignment

arXiv:2605.30808v1 Announce Type: new Abstract: Preference alignment is a crucial post-training step for large language models (LLMs) to ensure their outputs align with human values. However, post-training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we propose DPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy-preserving...

arXiv CS 9d ago

Calibrated Surprise: An Information-Theoretic Account of Creative Quality

arXiv:2604.26269v2 Announce Type: replace Abstract: In the era of large language models, creative writing quality lacks a computable theoretical anchor. The dominant approaches are rubric scoring -- decomposing holistic aesthetic judgment into sub-scores -- and RLHF preference signals -- replacing quality with group votes. Both bypass the statistical structure of the text itself.

arXiv CS 5d ago

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

arXiv:2605.27355v2 Announce Type: replace Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing...

arXiv CS 9d ago

Alignment-Aware Decoding

Announce Type: replace Abstract: Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference.

arXiv CS 7d ago

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

arXiv:2606.00357v2 Announce Type: replace Abstract: Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired preference data from weak-weaker model pairs (e.g., Qwen3 4B over 1.7B), despite the limited quality of individual responses, can provide an effective supervision signal through relative quality deltas, which we term a "weak" signal. This motivates a key research question: can multiple "weak" signals be...

arXiv CS 2d ago

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

arXiv:2603.03291v2 Announce Type: replace Abstract: Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence.

arXiv CS 8d ago

Teach a Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling

arXiv:2507.06419v3 Announce Type: replace Abstract: Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically...

arXiv CS 2d ago

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

arXiv:2606.01755v1 Announce Type: new Abstract: Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To...

arXiv CS 8d ago

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

arXiv:2606.08076v1 Announce Type: new Abstract: Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of J\"urgen Habermas' Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling...

arXiv CS 1d ago