Home › Knowledge Base › Corpus

Corpus

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

arXiv:2606.03504v1 Announce Type: new Abstract: We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538...

arXiv CS 7d ago

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

arXiv:2605.31469v1 Announce Type: new Abstract: Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while...

arXiv CS 9d ago

A French Corpus Annotated for Multiword Expressions with Adverbial Function

arXiv:2606.04828v1 Announce Type: new Abstract: This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results.

arXiv CS 6d ago

The role of falx cerebri in the selective vulnerability of splenium within the corpus callosum

The corpus callosum is the largest white matter structure connecting the two cerebral hemispheres and is anatomically divided into three major subregions along the anteroposterior axis: the genu, midbody, and splenium. The splenium is frequently affected in traumatic head impacts, yet the biomechanical basis for this selective vulnerability remains poorly understood. Clinical studies have long hypothesized that the falx cerebri contributes to the splenial susceptibility because of its close...

bioRxiv 8d ago

A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

Announce Type: new Abstract: We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian...

arXiv CS 5d ago

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Announce Type: replace Abstract: Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language query, a task that is especially challenging for multi-verb queries where temporal action ordering is critical. Existing approaches often rely solely on text or static images and struggle to capture implicit motion dynamics, leading to retrieval errors and temporal misalignment. We propose GenSpan, a generation-calibrated...

arXiv CS 6d ago

Germany: Bavarian lake procession marks centuries-old boat tradition for Corpus Christi

The annual event, held every year since 1935, marks the Catholic feast of Corpus Christi, observed ten days after Pentecost. Participants dressed in traditional Bavarian clothing joined altar servers and church officials for the journey across Staffelsee to the island of Wörth. A small fleet of boats carries the procession, accompanied by bells, incense and hymns, as the Blessed Sacrament is brought to the island church for Mass.

Euronews 5d ago

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

Announce Type: new Abstract: Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5...

arXiv CS 2d ago

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

arXiv:2606.07996v1 Announce Type: new Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only...

arXiv CS 1d ago

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

arXiv:2605.21347v3 Announce Type: replace Abstract: Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens.

arXiv CS 2d ago