the Automatic Evaluation Framework
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation
Announce Type: replace Abstract: Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility.
Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis
arXiv:2604.23435v2 Announce Type: replace Abstract: Grading knee osteoarthritis (KOA) on plain radiographs is poorly reproducible across readers. A single-grade disagreement on the Kellgren-Lawrence (KL) scale can alter surgical management or redirect a patient from conservative therapy to intra-articular injection. Meanwhile, deep learning models that outperform human readers often offer no explanation for their decisions.
Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation
Announce Type: new Abstract: Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which misalign with how humans judge informativeness and relevance. We introduce Semantic R-Precision (SemR-p), a novel evaluation metric that integrates semantic similarity into the rank-aware R-Precision framework.
RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation
arXiv:2606.08769v1 Announce Type: new Abstract: Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an...
Preference-Aware Rubric Learning for Personalized Evaluation
Announce Type: new Abstract: As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods-ranging from automatic metrics to LLM-as-a-judge approaches-fail to capture subjective, user-specific preferences embedded in long-term interaction histories.
A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI
Announce Type: new Abstract: This paper introduces a new systematic framework for detecting anomalies in maritime Automatic Identification System (AIS) datasets. These anomalies include abnormal vessel behaviours related to speed, position jumps, time gaps, and turn angles. Although unsupervised learning algorithms such as Isolation Forest are widely used for detecting anomalous vessel movements, they often lack systematic and meaningful evaluation measures.
Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation
arXiv:2510.24870v2 Announce Type: replace Abstract: We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal settings.
Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios
arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese...
MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization
Announce Type: new Abstract: Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Adaptive Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics.
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
arXiv:2606.03577v1 Announce Type: new Abstract: Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint...