Home Knowledge Base Weighted Success Rate

Weighted Success Rate

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

Announce Type: replace Abstract: Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate...

arXiv CS 6d ago

Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

Announce Type: new Abstract: Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an underspecified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate...

arXiv CS 7d ago

Hardware-aware Low-latency Quantum Compilation with Data-driven Lightweight Error Detection for Early Fault-Tolerant Systems

arXiv:2606.07666v1 Announce Type: cross Abstract: Noisy intermediate-scale quantum (NISQ) processors are entering an early fault-tolerance regime where full quantum error correction carries prohibitive resource costs, yet lightweight error detection can meaningfully improve algorithmic success rates. Existing compilation and error-detection toolchains treat these concerns in isolation, with no principled way to balance detection overhead against success probability under latency constraints....

arXiv CS 1d ago

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

arXiv:2606.05233v1 Announce Type: new Abstract: Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs.

arXiv CS 5d ago

Claude Fable 5

Claude Fable 5 and Claude Mythos 5 Today we’re launching Claude Fable 5: a Mythos-class1 model that we’ve made safe for general use. Fable 5’s capabilities exceed those of any model we’ve ever made generally available.

Hacker News 1d ago

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

Announce Type: new Abstract: The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark...

arXiv CS 1d ago

How back is the U? Can Clemson rebound? Previewing...

Life in the ACC certainly isn't boring. In the past year alone, the conference has produced a long and awkward CFP rankings battle, an irate affiliate member, a thrilling national title game run, the strangest tiebreaker result imaginable, an out-of-nowhere 11-win season, the most disappointing team in the country, an epic pro-to-college face-plant, 18 of the 38 best games of the 2025 season, the No. 1 pick in the NFL draft (indirectly) and the most awkward possible move to nine-game...

ESPN 7d ago

Human-Like Neural Nets by Catapulting

Human-like Neural Nets by Catapulting Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence. There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are...

Hacker News 3d ago

Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports

arXiv:2605.17561v2 Announce Type: replace Abstract: Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources.

arXiv CS 2d ago

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

Announce Type: replace Abstract: Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages...

arXiv CS 9d ago