MMLU-Pro, GPQA
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
arXiv:2605.28829v2 Announce Type: replace Abstract: Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata...
Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills
Announce Type: replace Abstract: Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise.
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs
arXiv:2606.01400v1 Announce Type: new Abstract: Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers...
Launch HN: General Instinct (YC P26) – Frontier models on edge devices
Hey HN, Guanming and Bill here from General Instinct (https://general-instinct.com/).After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available. The models that performed best were usually designed around datacenter assumptions: large GPUs, lots of memory bandwidth, and reliable network access. But most physical systems have the opposite constraints.
Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models
Announce Type: replace Abstract: Large language models increasingly rely on inference-time procedures such as chain-of-thought reasoning, self-refinement, retrieval augmentation, and verifier-guided revision, yet the structure of elicited probability transformations under these procedures remains poorly understood. We study externally elicited probability assignments over candidate answers and observe recurring approximate log-ratio relationships: \[ \log \tilde q_t(i) = \alpha_t \left( \log...