the Oracle Performance Gap
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation
Announce Type: new Abstract: In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across...
Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
arXiv:2510.10541v2 Announce Type: replace Abstract: Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress. To study this phenomenon, we introduce a diagnostic suite and the...
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
arXiv:2605.00369v4 Announce Type: replace Abstract: We study how large language models can be used to generate inventory policies in online settings with non-stationary demand. Our work is motivated by recent advances in LLM-based evolutionary search, such as AlphaEvolve, which demonstrates strong performance on static and highly structured problems such as mathematical discovery, but is not directly suited to dynamic inventory settings with online updates. We propose InvEvolve, an...
On the Generalization Gap in Self-Evolving Language Model Reasoning
new Abstract: Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework:...
On the Generalization Gap in Self-Evolving Language Model Reasoning
arXiv:2606.01075v2 Announce Type: new Abstract: Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training?
Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams
arXiv:2606.01770v1 Announce Type: new Abstract: Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These...
Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams
arXiv:2606.01770v2 Announce Type: replace Abstract: Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These...
SEArch: Optimistic Policy Selection Between Scene Noise and Drift for UAV Radar Search
Announce Type: new Abstract: Unmanned Aerial Vehicles (UAVs) equipped with radar sensors are deployed for target search missions in diverse environments, where targets exhibit characteristic signatures (e.g., respiration micro-motion in human search) detectable through occlusions. A fundamental challenge arises from shifts in radar statistics as the UAV moves through a dynamic and potentially non-stationary environment, rendering any fixed signal-processing strategy suboptimal; yet...
Molecular Embedding-Based Algorithm Selection in Protein-Ligand Docking
arXiv:2512.02328v2 Announce Type: replace-cross Abstract: Selecting an effective docking algorithm is highly context-dependent, and no single method performs reliably across structural, chemical, and protocol regimes. MolAS is a lightweight algorithm-selection model that predicts per-algorithm performance from pretrained protein and ligand embeddings using attentional pooling and a shallow residual decoder. With hundreds to a few thousand labelled complexes, MolAS achieves up to a 15...
AI Job Grief: The Unnamed Psychological Crisis Hitting Tech Workers
AI Job Grief: The Unnamed Psychological Crisis Hitting Tech Workers In the summer of 2025, an Epic Games layoff cut a worker who was a terminally ill father. According to the most-discussed account of the episode, his family lost his life insurance along with the job.