Data Selection Method
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
arXiv:2502.12119v4 Announce Type: replace Abstract: Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or...
BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining
Announce Type: replace Abstract: Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence,...
PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning
arXiv:2605.21422v3 Announce Type: replace Abstract: As LLMs continue to scale up, improving training efficiency heavily relies on effective data utilization. Data selection mitigates this issue by allocating the limited training budget to high-value examples that optimally facilitate the model's target behavior. Most existing approaches define target behavior via a set of target examples and score candidate training data based on their estimated influence on these samples.
PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning
Announce Type: replace Abstract: As LLMs continue to scale up, improving training efficiency heavily relies on effective data utilization. Data selection mitigates this issue by allocating the limited training budget to high-value examples that optimally facilitate the model's target behavior. Most existing approaches define target behavior via a set of target examples and score candidate training data based on their estimated influence on these samples.
Unifying and Optimizing Data Values for Selection via Sequential Decision-Making
arXiv:2502.04554v2 Announce Type: replace Abstract: Data selection has emerged as a crucial downstream application of data valuation, yet the theoretical foundations for using data values in selection remain underexplored. We reformulate data selection as a sequential decision-making problem where the optimal selection sequence arises from dynamic programming, and data values can be understood as encodings of this optimal sequence. This framework unifies and reinterprets existing methods...
The Long-Term Effects of Data Selection in LLM Fine-Tuning
arXiv:2605.30537v1 Announce Type: new Abstract: Data selection is increasingly used to reduce the cost of large language model (LLM) fine-tuning, with recent methods prioritizing samples by current utility, diversity, quality, or influence. This paper studies a different question: when fine-tuning occurs over multiple stages, can selection strategies that look optimal now make the model less adaptable later? We introduce a long-horizon view of LLM data selection in which a selector is...
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection
arXiv:2605.30288v2 Announce Type: replace Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective...
Prototype Selection Using Topological Data Analysis
arXiv:2511.04873v2 Announce Type: replace-cross Abstract: Prototype selection methods compress a training set, but the existing taxonomy of condensation, edition, hybrid, competence-based, optimization-based, and clustering-based families does not include methods that operate on the multi-scale topological structure of the data. This paper introduces two different persistence-based prototype selector variants, Topological Prototype Selector (TPS) and Boundary-Conscious Topological Prototype...
Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning
arXiv:2605.26761v2 Announce Type: replace Abstract: Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection...
A Game-Theoretic Decision Framework for Optimal Selection of Coordination Detection Methods in Multi-UAV Fleet Operations
arXiv:2606.02383v1 Announce Type: new Abstract: Detecting coordination among unmanned aerial vehicle (UAV) fleets operating in shared airspace and identifying the route-lead aircraft whose navigation decisions govern fleet behavior presents a fundamental speed--accuracy trade-off: fast methods enable real-time traffic management but sacrifice detection fidelity, while accurate methods may exceed the time budget for actionable airspace deconfliction. This paper presents a game-theoretic...