Home Knowledge Base SWE

SWE

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

arXiv:2602.23866v2 Announce Type: replace Abstract: Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource...

arXiv CS 8d ago

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

arXiv:2602.11210v5 Announce Type: replace Abstract: Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without...

arXiv CS 8d ago

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

arXiv:2606.07412v1 Announce Type: new Abstract: LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop...

arXiv CS 2d ago

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

arXiv:2606.07297v1 Announce Type: new Abstract: Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of...

arXiv CS 2d ago

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

arXiv:2605.12925v3 Announce Type: replace Abstract: Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false.

arXiv CS 7d ago

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

Announce Type: new Abstract: Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, and self-directed stopping, but they do not by themselves explain why particular moves were chosen, what evidence was trusted, or when understanding was judged sufficient. This tension...

arXiv CS 1d ago

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

arXiv:2512.18552v3 Announce Type: replace Abstract: While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training...

arXiv CS 7d ago

Microsoft's MAI-Code-1-Flash Scores 51% SWE-Bench Pro with Just 5B Active Params

MAI-Code-1-Flash Features Coding task reasoning Agentic execution Broad programming language support Fluent across programming languages, frameworks, and ecosystems. Optimized for GitHub Copilot in VS Code Performance SWE-Bench Pro 0 % Coding capabilities AIME 2026 0 % Math performance IFBench 0 % Instruction following

Hacker News 8d ago

SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

arXiv:2606.05249v1 Announce Type: new Abstract: Building infrastructure-as-code (IaC) in cloud computing is a critical task, underpinning the reliability, scalability, and security of modern software systems. Despite the remarkable progress of large language models (LLMs) in software engineering -- demonstrated across many dedicated benchmarks -- their capabilities in developing IaC remain underexplored. Unlike existing IaC benchmarks that predominantly center on declarative paradigms such...

arXiv CS 5d ago

SWE-IF: Aligning Code Evaluation with Human Preference

arXiv:2510.07315v2 Announce Type: replace Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional...

arXiv CS 2d ago