Home › Knowledge Base › Cohen kappa

Cohen kappa

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

arXiv:2606.00093v1 Announce Type: cross Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria -- the common case in rubric-based evaluation,...

arXiv Physics 8d ago

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations...

arXiv CS 5d ago

Early Detection of Alzheimer's Disease Using Explainable Machine Learning on Clinical Biomarkers: A Multi-Class Classification Study Using the Alzheimer's Disease Neuroimaging Initiative (ADNI) Dataset

Announce Type: new Abstract: Background: Alzheimer's disease (AD) affects over 55 million people worldwide. Accurate, interpretable detection of normal cognition (NC), mild cognitive impairment (MCI), and AD from routine clinical assessments remains a critical unmet need. Methods: An XGBoost classifier was developed for three-class detection using eight clinical features from the Alzheimer's Disease Neuroimaging Initiative (ADNI): MMSE, CDR Global, CDR Sum of Boxes (CDR-SB), MoCA, FAQ, age,...

arXiv CS 6d ago

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

Announce Type: new Abstract: Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language processing because existing pipelines rely primarily on lexical-surface methods that cannot recover implicit competencies, lack grounding in shared taxonomies, and provide no formal measures of extraction reliability or document-level completeness. To address these limitations, this paper proposes a four-stage NLP framework...

arXiv CS 8d ago

CATEKAPPA: An R Shiny Application for Design and Analysis of Consistency Tests Based on the Kappa Statistic for Categorical Responses

arXiv:2606.07062v1 Announce Type: cross Abstract: The kappa statistic is the most widely used measure of inter-rater agreement for categorical data. Despite its popularity, applied researchers often encounter two major hurdles: (i) determining the sample size required to achieve a desired level of agreement with given power, and (ii) computing appropriate kappa coefficients with proper interpretation.

arXiv CS 2d ago

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors....

arXiv CS 5d ago

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

arXiv:2606.04773v1 Announce Type: new Abstract: Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models...

arXiv CS 6d ago

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

arXiv:2606.02109v1 Announce Type: new Abstract: Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evaluation approaches fundamentally different from academic benchmarks. Spider and BIRD established execution-accuracy protocols; G-Eval and RAGAS advanced LLM-based assessment; and recent work such as Spider 2.0, BEAVER, and BIRD-Interact has begun to address enterprise and agentic dimensions. No single...

arXiv CS 8d ago

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

arXiv:2606.04296v1 Announce Type: new Abstract: As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and...

arXiv CS 6d ago

Multi-feature Classification to Improve Colorimetric Loop-Mediated Isothermal Amplification Fidelity

Loop-mediated isothermal amplification (LAMP) is a cost-effective and portable assay technique for performing nucleic acid-based diagnostics in the field whose adoption is hindered by design and reproducibility issues. This is due to a complex primer design process that fine-tunes parameters across 6-8 binding regions. The likelihood of assay success depends on satisfying thermodynamic and secondary structure constraints while maintaining target specificity and avoiding overlaps between...

bioRxiv 2d ago