Home Knowledge Base Data Selection Method

Data Selection Method

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

arXiv:2502.12119v4 Announce Type: replace Abstract: Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or...

arXiv CS 9d ago

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

Announce Type: replace Abstract: Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence,...

arXiv CS 8d ago

PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning

arXiv:2605.21422v3 Announce Type: replace Abstract: As LLMs continue to scale up, improving training efficiency heavily relies on effective data utilization. Data selection mitigates this issue by allocating the limited training budget to high-value examples that optimally facilitate the model's target behavior. Most existing approaches define target behavior via a set of target examples and score candidate training data based on their estimated influence on these samples.

arXiv CS 8d ago

PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning

Announce Type: replace Abstract: As LLMs continue to scale up, improving training efficiency heavily relies on effective data utilization. Data selection mitigates this issue by allocating the limited training budget to high-value examples that optimally facilitate the model's target behavior. Most existing approaches define target behavior via a set of target examples and score candidate training data based on their estimated influence on these samples.

arXiv CS 9d ago

Unifying and Optimizing Data Values for Selection via Sequential Decision-Making

arXiv:2502.04554v2 Announce Type: replace Abstract: Data selection has emerged as a crucial downstream application of data valuation, yet the theoretical foundations for using data values in selection remain underexplored. We reformulate data selection as a sequential decision-making problem where the optimal selection sequence arises from dynamic programming, and data values can be understood as encodings of this optimal sequence. This framework unifies and reinterprets existing methods...

arXiv CS 9d ago

The Long-Term Effects of Data Selection in LLM Fine-Tuning

arXiv:2605.30537v1 Announce Type: new Abstract: Data selection is increasingly used to reduce the cost of large language model (LLM) fine-tuning, with recent methods prioritizing samples by current utility, diversity, quality, or influence. This paper studies a different question: when fine-tuning occurs over multiple stages, can selection strategies that look optimal now make the model less adaptable later? We introduce a long-horizon view of LLM data selection in which a selector is...

arXiv CS 9d ago

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

arXiv:2605.30288v2 Announce Type: replace Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective...

arXiv CS 9d ago

Prototype Selection Using Topological Data Analysis

arXiv:2511.04873v2 Announce Type: replace-cross Abstract: Prototype selection methods compress a training set, but the existing taxonomy of condensation, edition, hybrid, competence-based, optimization-based, and clustering-based families does not include methods that operate on the multi-scale topological structure of the data. This paper introduces two different persistence-based prototype selector variants, Topological Prototype Selector (TPS) and Boundary-Conscious Topological Prototype...

arXiv CS 8d ago

Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning

arXiv:2605.26761v2 Announce Type: replace Abstract: Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection...

arXiv CS 5d ago

A Game-Theoretic Decision Framework for Optimal Selection of Coordination Detection Methods in Multi-UAV Fleet Operations

arXiv:2606.02383v1 Announce Type: new Abstract: Detecting coordination among unmanned aerial vehicle (UAV) fleets operating in shared airspace and identifying the route-lead aircraft whose navigation decisions govern fleet behavior presents a fundamental speed--accuracy trade-off: fast methods enable real-time traffic management but sacrifice detection fidelity, while accurate methods may exceed the time budget for actionable airspace deconfliction. This paper presents a game-theoretic...

arXiv CS 8d ago