Home Knowledge Base Fleiss

Fleiss

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

arXiv:2606.00093v1 Announce Type: cross Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria -- the common case in rubric-based evaluation,...

arXiv Physics 8d ago

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations...

arXiv CS 5d ago

CATEKAPPA: An R Shiny Application for Design and Analysis of Consistency Tests Based on the Kappa Statistic for Categorical Responses

arXiv:2606.07062v1 Announce Type: cross Abstract: The kappa statistic is the most widely used measure of inter-rater agreement for categorical data. Despite its popularity, applied researchers often encounter two major hurdles: (i) determining the sample size required to achieve a desired level of agreement with given power, and (ii) computing appropriate kappa coefficients with proper interpretation.

arXiv CS 2d ago

Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

Announce Type: new Abstract: As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available...

arXiv CS 1d ago