Home › Knowledge Base › Language Model Evaluation Harness

Language Model Evaluation Harness

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Lessons from the Trenches on Reproducible Evaluation of Language Models

arXiv:2405.14782v3 Announce Type: replace Abstract: Reliable evaluation of language models (LMs) remains an open challenge. Re- searchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. Evaluation difficulties are exacer- bated by the fracturing and siloing of information about conventions and common practices.

arXiv CS 8d ago

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

arXiv:2603.22327v2 Announce Type: replace Abstract: Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference...

arXiv CS 2d ago

It's Not Just X. It's Y

It's Y. Against the Quantification of Integrity When the measure of language becomes its target, it ceases to be good language. "It's not x, it's y." Large Language Models gravitate toward this type of construction, called negative parallelism. It has its uses: it sets up a contrast.

Hacker News 9d ago

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation harness that addresses both.

arXiv CS 7d ago

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

new Abstract: As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness.

arXiv CS 5d ago

Self-Evolving Deep Research via Joint Generation and Evaluation

arXiv:2606.04507v1 Announce Type: new Abstract: Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent...

arXiv CS 6d ago

MUSE: A Unified Agentic Harness for MLLMs

Announce Type: new Abstract: Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any...

arXiv CS 7d ago

Rethinking Search as Code Generation

Rethinking Search as Code Generation Evolving search from monolithic services to programmable primitives for the era of agent harnesses. Search is a core primitive for AI systems. Frontier models grow more capable by the month, but they still need access to fresh, accurate, and well-curated knowledge from the wider world.

Hacker News 8d ago

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

Announce Type: new Abstract: Engineering diagrams pose a distinct challenge for vision-language models: unlike natural images or general documents, they encode information through dense spatial layouts, domain-specific symbols, and cross-references between visual callouts and structured parts tables. Despite their centrality to service, repair, and design workflows, there is no public benchmark for measuring VLM capabilities in this domain; existing datasets primarily focus on flowcharts,...

arXiv CS 7d ago

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

arXiv:2507.09751v3 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit problems with logical consistency in their output. How can we harness LLMs' broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic.

arXiv CS 1d ago