Corpus
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language
arXiv:2606.03504v1 Announce Type: new Abstract: We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538...
Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus
arXiv:2605.31469v1 Announce Type: new Abstract: Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while...
A French Corpus Annotated for Multiword Expressions with Adverbial Function
arXiv:2606.04828v1 Announce Type: new Abstract: This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results.
The role of falx cerebri in the selective vulnerability of splenium within the corpus callosum
The corpus callosum is the largest white matter structure connecting the two cerebral hemispheres and is anatomically divided into three major subregions along the anteroposterior axis: the genu, midbody, and splenium. The splenium is frequently affected in traumatic head impacts, yet the biomechanical basis for this selective vulnerability remains poorly understood. Clinical studies have long hypothesized that the falx cerebri contributes to the splenial susceptibility because of its close...
A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation
Announce Type: new Abstract: We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian...
GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval
Announce Type: replace Abstract: Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language query, a task that is especially challenging for multi-verb queries where temporal action ordering is critical. Existing approaches often rely solely on text or static images and struggle to capture implicit motion dynamics, leading to retrieval errors and temporal misalignment. We propose GenSpan, a generation-calibrated...
Germany: Bavarian lake procession marks centuries-old boat tradition for Corpus Christi
The annual event, held every year since 1935, marks the Catholic feast of Corpus Christi, observed ten days after Pentecost. Participants dressed in traditional Bavarian clothing joined altar servers and church officials for the journey across Staffelsee to the island of Wörth. A small fleet of boats carries the procession, accompanied by bells, incense and hymns, as the Blessed Sacrament is brought to the island church for Mass.
HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule
Announce Type: new Abstract: Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5...
MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models
arXiv:2606.07996v1 Announce Type: new Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only...
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
arXiv:2605.21347v3 Announce Type: replace Abstract: Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens.