Home Knowledge Base 32B

32B

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

arXiv:2606.09662v1 Announce Type: new Abstract: Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern...

arXiv CS 1d ago

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

arXiv:2606.02162v1 Announce Type: new Abstract: Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous...

arXiv CS 8d ago

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

arXiv:2603.17145v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the...

arXiv CS 9d ago

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

arXiv:2606.02011v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed...

arXiv CS 8d ago

Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

arXiv:2606.06306v1 Announce Type: new Abstract: Factual sycophancy occurs when a language model abandons a correct, verifiable answer under social pressure. Because a flip occurs only when pressure toward a false answer exceeds the model's neutral preference for the truth, flip rates conflate two mechanisms: the strength of that baseline preference (truth margin), and how far pressure shifts it (manipulation sensitivity). We decompose factual sycophancy into these channels and use them to...

arXiv CS 5d ago

Causal state binding predicts action control in language agents

arXiv:2605.09692v3 Announce Type: replace Abstract: Autonomous language agents increasingly expose traces, memories, plans and constraints, but existing evaluations rarely test whether these state variables are bound to final actions. We introduce causal state binding, an intervention-coupled evaluation framework that measures whether actions change with the event-specific decisive state while remaining invariant to irrelevant cues. The primary readout is a hidden-target finite-action...

arXiv CS 8d ago

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

arXiv:2606.01246v1 Announce Type: new Abstract: Text-to-SQL on complex schemas is unreliable on a single pass, so recent systems generate multiple SQL candidates and let voting filter out errors. Yet voting alone is not enough, because the multi-candidate recipe has three coupled weaknesses: 1) sampling more from a single generator produces increasingly redundant candidates, 2) existing pipelines apply one generic correction to every non-clean execution result, while runtime errors,...

arXiv CS 8d ago

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

arXiv:2606.07108v2 Announce Type: replace Abstract: Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this...

arXiv CS 1d ago

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

arXiv:2601.02880v2 Announce Type: replace Abstract: Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free framework that closes this gap through adaptive tree exploration with tool-augmented node refinement, typed-failure backtracking that injects categorized error context into the recovered branch, and a...

arXiv CS 5d ago

MLIPilot: LLM-Driven Auto-Research for Machine-Learned Interatomic Potentials

arXiv:2605.30889v1 Announce Type: new Abstract: Constructing production-quality machine-learned interatomic potentials (MLIPs) requires balancing accuracy, dynamical stability, and computational throughput under constraints that are not captured by a single training loss. We introduce MLIPilot, an auto-research framework in which tool-calling large language models propose hypotheses, edit MLIP training code, launch HPC jobs, and accept or revert changes using a fixed, physically constrained...

arXiv Physics 9d ago