Evaluation Resources
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Multilingual Training and Evaluation Resources for Vision-Language Models
arXiv:2604.18347v2 Announce Type: replace Abstract: Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training...
Toward Third-Party Assurance of AI Systems: Design Requirements, Prototype, and Early Testing
arXiv:2601.22424v2 Announce Type: replace Abstract: As Artificial Intelligence (AI) systems proliferate, the need for systematic, transparent, and actionable processes for evaluating them is growing. While many resources exist to support AI evaluation, they have several limitations. Few address both the process of designing, developing, and deploying an AI system and the outcomes it produces.
Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages
arXiv:2606.02147v1 Announce Type: new Abstract: Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Announce Type: replace Abstract: Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases.
Optimal transmission expansion modestly reduces decarbonization costs of U.S. electricity
arXiv:2402.14189v4 Announce Type: replace-cross Abstract: Major government studies and policy reports project that substantial expansion of interregional transmission will be needed to integrate clean energy and ensure reliability in decarbonized power systems. Using the open-source Switch capacity expansion model with detailed representation of existing U.S. generation and transmission infrastructure, solar, wind, and storage resources, and hourly operations, we evaluate the role of...
Revisiting Lexicon Evaluation in Unsupervised Word Discovery
arXiv:2606.06183v1 Announce Type: cross Abstract: Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster.
A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text
arXiv:2605.22978v2 Announce Type: replace Abstract: Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic...
Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion
arXiv:2606.09603v1 Announce Type: new Abstract: Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered...
Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study
Announce Type: new Abstract: Digital twins (DTs) allow the digitalization of road infrastructure inspection, though this is hindered by limited annotated data. This work exploits the relational nature of continuous asset condition monitoring to reformulate image-based defect detection as image difference classification (IDC) to reduce data reliance. This was evaluated in a case study on low-resource traffic sign inspection with different IDC classifiers using a newly-curated, high quality...
Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir
Announce Type: replace Abstract: This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and...