HumanEval
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Honest Lying: Understanding Memory Confabulation in Reflexive Agents
arXiv:2605.29463v2 Announce Type: replace Abstract: Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures. We show that this assumption can fail systematically: across ALFWorld and HumanEval, agents store confident but incorrect interpretations of the task and continue acting on them across trials, even though the environment resets to the correct task each time. We call this failure mode memory...
The Security Budget of Code LLMs: An Information-Theoretic Capacity-Security Bound
arXiv:2606.03308v2 Announce Type: replace Abstract: AI programming assistants make natural-language prompts a software-development interface, so small prompt perturbations become usability and security risks. We study an information-theoretic trade-off for code LLMs between functional capacity, $\Cap=\rmI(c^*;c_\pi)$, and perturbation retention, $\Sec=\rmI(c_\pi;\tilde c_\pi)$. Here $\Sec$ is a retention-channel quantity, not a direct measure of exploit success or vulnerable-code generation....
The Security Budget of Code LLMs: An Information-Theoretic Capacity-Security Bound
Announce Type: new Abstract: AI programming assistants make natural-language prompts a software-development interface, so small prompt perturbations become usability and security risks. We study an information-theoretic trade-off for code LLMs between functional capacity, $\Cap=\rmI(c^*;c_\pi)$, and perturbation retention, $\Sec=\rmI(c_\pi;\tilde c_\pi)$. Here $\Sec$ is a retention-channel quantity, not a direct measure of exploit success or vulnerable-code generation. For code completion...
KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
Announce Type: new Abstract: Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect...
Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill
arXiv:2606.06454v1 Announce Type: new Abstract: Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code.
SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation
Announce Type: new Abstract: Large code language models (CodeLLMs) can generate and rewrite programs, enabling functionality-preserving code mutation that may be used to create diverse malware variants and evade signature-based detection. A key security question is whether this mutation capability survives model compression, which would make deployment feasible under limited hardware budgets. We propose SecRL-Prune, a structured pruning framework for CodeLLMs that operates on feed-forward...
Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems
arXiv:2606.08030v1 Announce Type: new Abstract: Agentic tutoring systems introduce a coordination challenge: multiple agents may propose different but reasonable interventions, yet only one response can be delivered to the learner. In this paper, we study how voting protocols shape cooperation among four role-constrained pedagogical agents responsible for scaffolding, misconception, motivation, and metacognition. We compare four voting protocols -- simple, ranked, cumulative, and approval...
From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design
arXiv:2606.09663v1 Announce Type: new Abstract: Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier,...
FASE: Fast Adaptive Semantic Entropy for Code Quality
arXiv:2606.09800v1 Announce Type: new Abstract: Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks.
MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs
Announce Type: replace Abstract: LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to...