Home Knowledge Base Low-Resource Languages

Low-Resource Languages

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

Announce Type: new Abstract: Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder...

arXiv CS 1d ago

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

arXiv:2606.01196v1 Announce Type: new Abstract: Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually.

arXiv CS 8d ago

Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

arXiv:2606.03782v1 Announce Type: new Abstract: Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply grammatical information effectively during translation. Inspired by recent progress in chain-of-thought reasoning, we investigate whether low-resource MT can benefit from structured intermediate steps of linguistic...

arXiv CS 7d ago

$M^3$ Scaling Law: Optimizing Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

arXiv:2410.12325v2 Announce Type: replace Abstract: In this paper, we study a fundamental design problem in pretraining Large Language Models (LLMs) for low-resource language regimes. Existing works adopt multi-epoch, multi-lingual, and multi-stage training to utilize the limited target-language corpus efficiently, but no prior scaling law can compare recipes spanning these approaches under the same compute budget $C$ and target-language corpus size $D_T$, leaving the optimal training setup...

arXiv CS 8d ago

Effective vocabulary expansion of multilingual language models for extremely low-resource languages

arXiv:2602.09388v2 Announce Type: replace Abstract: Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported.

arXiv CS 6d ago

Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir

Announce Type: replace Abstract: This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and...

arXiv CS 8d ago

Mining Useful General Data for Low-Resource Domain Adaptation

arXiv:2511.07380v2 Announce Type: replace Abstract: Adapting large language models (LLMs) to low-resource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question-answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation?

arXiv CS 2d ago

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

arXiv:2606.09553v1 Announce Type: new Abstract: Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and...

arXiv CS 1d ago

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

arXiv:2606.02147v1 Announce Type: new Abstract: Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse.

arXiv CS 8d ago

Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

arXiv:2606.09767v1 Announce Type: new Abstract: Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing...

arXiv CS 1d ago