Home › Knowledge Base › Code Bench

Code Bench

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Announce Type: new Abstract: Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it.

arXiv CS 5d ago

Microsoft's MAI-Code-1-Flash Scores 51% SWE-Bench Pro with Just 5B Active Params

MAI-Code-1-Flash Features Coding task reasoning Agentic execution Broad programming language support Fluent across programming languages, frameworks, and ecosystems. Optimized for GitHub Copilot in VS Code Performance SWE-Bench Pro 0 % Coding capabilities AIME 2026 0 % Math performance IFBench 0 % Instruction following

Hacker News 8d ago

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Announce Type: new Abstract: This paper introduces Code Bench, a benchmark capable of evaluating Large Language Models (LLMs) concise code generation abilities in 60 programming languages. Based on code golf, a recreational programming competition focused on minimal character or byte solutions, the benchmark provides a distinctive measure of LLMs ability to produce efficient, concise code. Unlike existing benchmarks limited by fixed problem sets and language coverage, CodeGolf Bench...

arXiv CS 9d ago

SWR-Bench: Assessing LLM Performance in Real-World Code Review Comment Generation

arXiv:2509.01494v2 Announce Type: replace Abstract: Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWRBench , a new benchmark comprising 1000 manually verified Pull...

arXiv CS 2d ago

FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

arXiv:2512.20732v2 Announce Type: replace Abstract: As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured...

arXiv CS 9d ago

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

arXiv:2606.07297v1 Announce Type: new Abstract: Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of...

arXiv CS 2d ago

FrontierCode

Introducing FrontierCode Raising the bar from correctness to quality Today’s coding benchmarks have established that models can write correct code. But as AI-generated code becomes the dominant path to production, correctness is now table stakes. The question that we should be asking is: can models actually write good code?

Hacker News 2d ago

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

arXiv:2606.06473v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent...

arXiv CS 5d ago

MOLOT System Card: Malicious Operational Logic Observation Transformer

arXiv:2606.07792v1 Announce Type: new Abstract: MOLOT (Malicious Operational Logic Observation Transformer) is a static malicious-code detection system designed for SAST setup where package metadata, maintainer history, and dynamic execution traces may be unavailable or unreliable. The system represents source code as behavior sequences derived from static call graphs, includes an explanation stage that ranks suspicious behavior activities and maps them back to source-code locations.

arXiv CS 1d ago

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

arXiv:2605.31603v1 Announce Type: new Abstract: Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while...

arXiv CS 9d ago