Home Knowledge Base Benchmarking Coding Agents

Benchmarking Coding Agents

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

arXiv:2606.07297v1 Announce Type: new Abstract: Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of...

arXiv CS 2d ago

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Announce Type: new Abstract: Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it.

arXiv CS 5d ago

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors....

arXiv CS 5d ago

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

arXiv:2509.22097v5 Announce Type: replace Abstract: Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern. Existing benchmarks have provided valuable insights, but they fail to capture scenarios in which vulnerabilities are actually introduced by human developers, making fair comparisons between humans and agents infeasible.

arXiv CS 1d ago

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

arXiv:2606.01317v1 Announce Type: new Abstract: Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whether models refuse unsafe prompts, leaving impacts on stateful workspaces largely unexamined.

arXiv CS 8d ago

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

arXiv:2606.01057v1 Announce Type: new Abstract: Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for...

arXiv CS 8d ago

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

Announce Type: new Abstract: Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete.

arXiv CS 7d ago

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

arXiv:2605.30777v1 Announce Type: new Abstract: Autonomous coding agents built on large language models (LLMs) are rapidly being integrated into development workflows, yet their operational safety properties remain poorly understood beyond evaluations of explicitly malicious inputs. In practice, high-impact failures arise during benign, goal-directed use through environment breakage, fabricated success reports, etc. that current benchmarks do not capture. What categories of operational...

arXiv CS 9d ago

SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks

Announce Type: new Abstract: Code Agents have achieved remarkable advances in recent years, exhibiting strong capabilities across a wide range of software engineering tasks. However, their misuse often produces bloated and disorganized code that impairing readability, extensibility, and robustness. Despite this risk, existing benchmarks largely evaluate functional correctness rather than long-term maintainability of code agents.

arXiv CS 5d ago