Home › Knowledge Base › Benchmarking Autonomous Software

Benchmarking Autonomous Software

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SW-$A^2$-Bench: Benchmarking Autonomous Software Agent Generation for Agentic Web

Announce Type: replace Abstract: The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accomplish user goals. However, the capacity of Agentic Web is still limited by insufficient autonomous software agent population, which has become a crucial challenge for scaling Agentic Web. In order to alleviate this, we study the task of automatically converting existing code repositories into autonomous software agents via...

arXiv CS 2d ago

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

arXiv:2603.13428v2 Announce Type: replace Abstract: With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline...

arXiv CS 2d ago

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

Announce Type: new Abstract: AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that...

arXiv CS 6d ago

RAT: RunAnyThing via Fully Automated Environment Configuration

arXiv:2604.23190v2 Announce Type: replace Abstract: Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting...

arXiv CS 5d ago

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

arXiv:2512.18552v3 Announce Type: replace Abstract: While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training...

arXiv CS 7d ago

Sakana AI's Recursive Self-Improvement (RSI) Lab

The Next Paradigm of Artificial Intelligence As the world enters the era of artificial intelligence, Japan has a unique opportunity to reclaim its position at the frontier of global innovation. However, to achieve global leadership in AI and scientific discovery, we cannot simply stick to the conventional approach of brute-forcing monolithic models. We must leapfrog the current paradigm.

Hacker News 5d ago

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

arXiv:2606.02956v1 Announce Type: new Abstract: Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset built around high-fidelity sensors and maps. Our fully synchronized sensor suite combines high-resolution global-shutter cameras, long-range lidar beyond 400m, 4D imaging radar, and redundant GNSS/INS localization.

arXiv CS 7d ago

Claude Fable 5

Claude Fable 5 and Claude Mythos 5 Today we’re launching Claude Fable 5: a Mythos-class1 model that we’ve made safe for general use. Fable 5’s capabilities exceed those of any model we’ve ever made generally available.

Hacker News 1d ago

AI giant Anthropic files for US IPO as investors bet big on AI future

AI giant Anthropic files for US IPO as investors bet big on AI future Anthropic, which operates AI chatbot Claude, did not disclose the size or the terms of the offering. Artificial intelligence (AI) giant Anthropic has confidentially filed for an initial public offering (IPO) in the United States, teeing up what could become a watershed moment for Wall Street’s AI frenzy. The move, announced on Monday, sets up a high-stakes test of whether investor appetite for the AI revolution that has...

Al Jazeera 9d ago

Software Platform for Hybrid Pseudo-Random Sequence Generation and Predictability Analysis Based on LFSR and Mersenne Twister

arXiv:2605.30977v1 Announce Type: cross Abstract: Generating reliable random and pseudo-random sequences is important in many electronic and signal processing systems, such as secure communications, radar, spread-spectrum methods, and autonomous platforms. Although true and quantum random number generators provide stronger unpredictability, classical pseudo-random number generators, including Linear Feedback Shift Registers (LFSRs) and the Mersenne Twister (MT), are still widely used because...

arXiv CS 9d ago