Home Knowledge Base Lower-Resource Languages

Lower-Resource Languages

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

arXiv:2605.31136v1 Announce Type: new Abstract: In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations.

arXiv CS 9d ago

Efficient ASR Training with Conversations that Never Happened

arXiv:2606.03957v1 Announce Type: new Abstract: Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture,...

arXiv CS 7d ago

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

arXiv:2606.01322v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African contexts followed by human translation,...

arXiv CS 8d ago

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

Announce Type: new Abstract: We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian). The second part tests for orthography noise: we transliterate text across multiple...

arXiv CS 5d ago

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric tasks, leaving multilingual behaviour unexplored. We address this gap through a systematic study of adversarial robustness and multimodal safety across 12 diverse languages, evaluating open-source MLLMs that acquire...

arXiv CS 7d ago

Global PIQA: Evaluating Commonsense Reasoning Across 100+ Languages and Cultures

arXiv:2510.24081v2 Announce Type: replace Abstract: To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by over 350 researchers from over 65 countries around the world. The 141 language varieties in Global PIQA cover five continents, 19 language families,...

arXiv CS 8d ago

Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

arXiv:2509.13930v3 Announce Type: replace Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. Despite their growing use, an open questions is whether the mixture of different document languages impacts generation and citation behavior in unintended ways. To investigate this, we introduce a controlled methodology using model internals to measure language preference...

arXiv CS 1d ago

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

arXiv:2602.16346v4 Announce Type: replace Abstract: LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns.

arXiv CS 1d ago