Home Knowledge Base Diagnostic Evaluation Framework

Diagnostic Evaluation Framework

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

arXiv:2605.31504v1 Announce Type: new Abstract: Multimodal models in oncology can produce accurate predictions, but accurate prediction does not reveal whether the model has learned biology that is shared across modalities, biology confined to one modality, or spurious correlations that reflect confounders rather than genuine biology. We introduce DECAT, a model-agnostic post-hoc evaluation framework that classifies multimodal representations into four diagnostic scenarios for a given task...

arXiv CS 9d ago

A PMP-inspired Evaluation Framework for Assessing Deep-Learning Earth System Models

arXiv:2604.06567v3 Announce Type: replace Abstract: In recent years, Deep-Learning Earth System Models (DL-ESMs) have emerged as promising, computationally efficient complements to traditional Earth system models. Here, we present an evaluation framework for testing DL-ESMs from an Earth system model-development perspective using standardized diagnostics from the PCMDI Metrics Package (PMP). This framework allows DL-ESMs, including Ai2's ACE2 and Google's NeuralGCM, to be assessed with...

arXiv Physics 2d ago

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Announce Type: replace Abstract: Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity,...

arXiv CS 8d ago

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Announce Type: replace Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce Agent Planning Benchmark (APB), a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings,...

arXiv CS 2d ago

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Announce Type: new Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five...

arXiv CS 6d ago

LexRubric: A Rubric-Guided Diagnostic Benchmark for Open-Ended Legal Tasks

arXiv:2606.09389v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures.

arXiv CS 1d ago

A unified multi-task framework enables interpretable chest radiograph analysis

arXiv:2606.03417v1 Announce Type: new Abstract: While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transformer for Chest X-ray Analysis), a framework that emulates radiologists' diagnostic workflow through three evidence-driven stages: 1) Disease...

arXiv CS 7d ago

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

Announce Type: new Abstract: In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across...

arXiv CS 2d ago

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

Announce Type: replace Abstract: In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric knowledge can induce pre-evidence tendencies that may conflict with the retrieved context, yet existing evaluation frameworks do not characterize such prior-context discrepancy or measure how verifiers arbitrate between parametric and contextual signals. We introduce \textsc{PAVE} (\emph{Prior-Aware Verifier Evaluation}), a...

arXiv CS 2d ago

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

Announce Type: new Abstract: In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric knowledge can induce pre-evidence tendencies that may conflict with the retrieved context, yet existing evaluation frameworks do not characterize such prior-context discrepancy or measure how verifiers arbitrate between parametric and contextual signals.

arXiv CS 8d ago