Home Technology Senior SWE-Bench: open-source benchmark that assesses...
Technology

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

Key Points

Senior SWE-Bench We treat agents like senior engineers, so why evaluate them like junior engineers? Senior engineers build features without over-specified requirements Senior SWE-Bench feature tasks have realistic instructions that read like natural language messages rather than over-specified requirements. To reliably evaluate these tasks, we introduce a validation agent which uses expert-designed recipes to write behavioral tests that adapt to submitted solutions.

Senior SWE-Bench We treat agents like senior engineers, so why evaluate them like junior engineers? Senior engineers build features without over-specified requirements Senior SWE-Bench feature tasks have realistic instructions that read like natural language messages rather than over-specified requirements. To reliably evaluate these tasks, we introduce a validation agent which uses expert-designed recipes to write behavioral tests that adapt to submitted solutions. Senior engineers solve bugs that require runtime investigation from behavioral reports Senior SWE-Bench bug tasks reflect tricky user reports and focus on investigation, from starting services to debugging subtle runtime issues. They are sourced from PRs that needed significant runtime investigation to solve (e.g. logs, profiling data, reproduction steps). Senior engineers ship the right code without being told to Senior SWE-Bench scores tasteful solves by combining runtime correctness tests with several quality metrics based on observed codebase practices. In addition, verifiers and validation can test against load-bearing codebase practices that go unstated in instructions. Leaderboard - 1Claude Opus 4.8Mini-SWE-Agent · max24.0% - Claude Sonnet 5Mini-SWE-Agent · max19.4% - 2GPT-5.5Mini-SWE-Agent · xhigh16.0% - 3Claude Opus 4.7Mini-SWE-Agent · max14.1% - 4GPT-5.4Mini-SWE-Agent · xhigh14.0% - 5GLM-5.2Mini-SWE-Agent · max12.5% - 6Kimi K2.6Mini-SWE-Agent · default8.2% - 7Claude Sonnet 4.6Mini-SWE-Agent · high8.2% - 8Gemini 3.1 ProMini-SWE-Agent · high6.1% - 9Gemini 3.5 FlashMini-SWE-Agent · medium3.0% | # | Model | Effort | Solve rate (pass@1) | |---|---|---|---| | 1 | Claude Opus 4.8 | max | 24.0% | | Claude Sonnet 5 | max | 19.4% | | | 2 | GPT-5.5 | xhigh | 16.0% | | 3 | Claude Opus 4.7 | max | 14.1% | | 4 | GPT-5.4 | xhigh | 14.0% | | 5 | GLM-5.2 | max | 12.5% | | 6 | Kimi K2.6 | default | 8.2% | | 7 | Claude Sonnet 4.6 | high | 8.2% | | 8 | Gemini 3.1 Pro | high | 6.1% | | 9 | Gemini 3.5 Flash | medium | 3.0% | The top-performing frontier models fail to complete tasks with senior-level correctness and taste over 75% of the time. Tasks Senior SWE-Bench tasks are sourced from PRs in repos spanning libraries to multi-service applications, authored by engineers with hundreds of commits in their respective repos. We focus on multi-phase, multi-stack feature PRs and bug/performance PRs with significant runtime investigation. For more on task design, read the blog post More naturally under-specified instructions Senior SWE-Bench tasks reflect natural communication with agents, with a median instruction length 31% that of SWE-Bench Pro. More diverse task scope Senior SWE-Bench feature tasks can span across multiple services, with an average of 11 files touched per feature task. Longer task horizon Senior SWE-Bench tasks are designed to be long-horizon, requiring hundreds of steps for even the strongest agents. More naturally under-specified instructions Senior SWE-Bench tasks reflect natural communication with agents, with a median instruction length 31% that of SWE-Bench Pro. More diverse task scope Senior SWE-Bench feature tasks can span across multiple services, with an average of 11 files touched per feature task. Longer task horizon Senior SWE-Bench tasks are designed to be long-horizon, requiring hundreds of steps for even the strongest agents. Reference-solution SLOC & files are measured identically across all three benchmarks. Instruction length excludes harness boilerplate. Token and step counts for other benchmarks are based on their self-reported metrics.
SWE-Bench (ORG) max12.5% - 6Kimi (PERSON) K2.6Mini-SWE-Agent (ORG) high6.1% - 9Gemini 3.5 (PERSON) Claude Opus (PERSON) Claude Sonnet 5 (PERSON) GLM-5.2 (ORG) Kimi (PERSON) Claude Sonnet (PERSON) Gemini (ORG) SWE-Bench Pro (ORG) natura (ORG)
Originally published by Hacker News Read original →