Home › Knowledge Base › MMLU PRO

MMLU PRO

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

arXiv:2606.04109v2 Announce Type: replace Abstract: Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model...

arXiv CS 1d ago

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

arXiv:2606.04109v1 Announce Type: new Abstract: Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs...

arXiv CS 6d ago

Latent Performance Profiling of Large Language Models

Announce Type: replace Abstract: Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes...

arXiv CS 9d ago

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

arXiv:2606.08098v1 Announce Type: new Abstract: Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. We show that piping the signals every sample carries into a delegation-based aggregator (Propagational Proxy Voting, PPV) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099).

arXiv CS 1d ago

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

arXiv:2605.28829v2 Announce Type: replace Abstract: Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata...

arXiv CS 6d ago

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

arXiv:2508.06165v5 Announce Type: replace Abstract: Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address...

arXiv CS 7d ago

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

arXiv:2601.02880v3 Announce Type: replace Abstract: Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free framework that closes this gap through adaptive tree exploration with tool-augmented node refinement, typed-failure backtracking that injects categorized error context into the recovered branch, and a...

arXiv CS 1d ago

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

arXiv:2601.21522v2 Announce Type: replace Abstract: The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a...

arXiv CS 1d ago

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

Announce Type: replace Abstract: Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise.

arXiv CS 8d ago

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

arXiv:2601.02880v2 Announce Type: replace Abstract: Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free framework that closes this gap through adaptive tree exploration with tool-augmented node refinement, typed-failure backtracking that injects categorized error context into the recovered branch, and a...

arXiv CS 5d ago