Home › Knowledge Base › medical QA

medical QA

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

arXiv:2606.04262v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories.

arXiv CS 6d ago

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

arXiv:2606.04127v1 Announce Type: new Abstract: Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models.

arXiv CS 6d ago

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

arXiv:2602.17911v3 Announce Type: replace Abstract: Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is...

arXiv CS 1d ago

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Announce Type: replace Abstract: Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior.

arXiv CS 9d ago

The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment

arXiv:2606.08457v1 Announce Type: new Abstract: Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning.

arXiv CS 1d ago

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

arXiv:2508.06165v5 Announce Type: replace Abstract: Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address...

arXiv CS 7d ago

Truth, Trust, and Trouble: Medical AI on the Edge

Announce Type: replace Abstract: Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions.

arXiv CS 8d ago

RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

arXiv:2606.07401v1 Announce Type: new Abstract: Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a...

arXiv CS 2d ago

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

arXiv:2605.30637v1 Announce Type: new Abstract: Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM...

arXiv CS 9d ago