SWE-Bench Pro
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Microsoft's MAI-Code-1-Flash Scores 51% SWE-Bench Pro with Just 5B Active Params
MAI-Code-1-Flash Features Coding task reasoning Agentic execution Broad programming language support Fluent across programming languages, frameworks, and ecosystems. Optimized for GitHub Copilot in VS Code Performance SWE-Bench Pro 0 % Coding capabilities AIME 2026 0 % Math performance IFBench 0 % Instruction following
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
arXiv:2606.07412v1 Announce Type: new Abstract: LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop...
FrontierCode
Introducing FrontierCode Raising the bar from correctness to quality Today’s coding benchmarks have established that models can write correct code. But as AI-generated code becomes the dominant path to production, correctness is now table stakes. The question that we should be asking is: can models actually write good code?
Toward Training Superintelligent Software Agents through Self-Play SWE-RL
arXiv:2512.18552v3 Announce Type: replace Abstract: While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training...
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Announce Type: new Abstract: AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings.
I Put a Datacenter GPU in My Gaming PC for £200
I Put a Datacenter GPU in My Gaming PC for £200 I already had an RTX 4080. Good enough for gaming, not good enough for the models I wanted to run locally. The next step up in GPU land is either spend a fortune on a card with more VRAM, or find another way.