Unit Test Generation
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Knowledge Matters: Injecting Project and Testing Knowledge into LLM-based Unit Test Generation
arXiv:2511.14224v3 Announce Type: replace Abstract: Automated unit test generation using large language models (LLMs) holds great promise but often struggles with generating tests that are both correct and maintainable in real-world projects. This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge to enhance LLM-based test generation. Our approach first extracts project structure and usage knowledge through static analysis, which...
Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go
arXiv:2511.10868v2 Announce Type: replace Abstract: Training data imbalance poses a major challenge for code LLMs. Most available data heavily over represents raw opensource code while underrepresenting broader software engineering tasks, especially in low resource languages like Golang. As a result, models excel at code autocompletion but struggle with real world developer workflows such as unit test generation.
LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs
arXiv:2606.08588v1 Announce Type: new Abstract: Large language models (LLMs) have shown considerable promise for automated unit test generation, yet their practical effectiveness relative to human-written tests remains poorly understood. Existing evaluations commonly rely on coverage-oriented benchmarks that do not assess fault-detection capability directly. We present an empirical comparison of LLM-generated and human-written unit tests across three complementary Python benchmarks: 29 real...
Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback
arXiv:2605.30478v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) trains language models using programmatically checkable signals such as unit-test outcomes, enabling direct optimization for functional correctness in code generation. We conduct an empirical study of RLVR for Python code generation on the MBPP benchmark using two small models (Qwen3-0.6B and Llama3.2-1B) with LoRA fine-tuning. Across multiple reward formulations such as: unit-test-only...
RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation
arXiv:2606.08769v1 Announce Type: new Abstract: Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an...
WES STREETING: 'Not changing Labour would be like sticking two fingers up to public'
WES STREETING: 'Not changing Labour would be like sticking two fingers up to public' 'The test for each generation is to pass something better to the next, but kids today are facing a worse future', the former Health Secretary Wes Streeting writes Voters sent a clear message in this month’s elections. Labour lost big time to nationalists in England, Scotland, and Wales. Ploughing on regardless would be like sticking two fingers in our ears and another two up to the public.
AI, Ashby Engineering, and the future
AI, Ashby Engineering, and the Future 15 minute read Since August 2025, more than half of the new code hitting Ashby’s production systems has been AI-generated, yet customer issues remain broadly stable. More AI-written code. We have a blip in March / April every year; these cyclical patterns aren’t relevant to explain here.
Before the Model Learns the Bug:Fuzzing RLVR Verifiers
arXiv:2606.01066v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and...
VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents
arXiv:2606.05395v1 Announce Type: new Abstract: Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level...
Alpha-RTL: Test-Time Training for RTL Hardware Optimization
arXiv:2606.05253v1 Announce Type: new Abstract: Large language models (LLMs) have shown increasing promise in generating functionally correct register-transfer-level (RTL) hardware designs. Recent systems improve further through EDA-integrated reinforcement learning with syntax, simulation, and PPA rewards, but train a general RTL generator before deployment while test-time approaches search with a frozen policy. We instead perform reinforcement learning at test time, allowing the LLM policy...