Home › Knowledge Base › Pretraining

Pretraining

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

ActiveMimic: Egocentric Video Pretraining with Active Perception

Announce Type: new Abstract: Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining...

arXiv CS 5d ago

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

Announce Type: replace Abstract: Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence,...

arXiv CS 8d ago

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

arXiv:2606.07996v1 Announce Type: new Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only...

arXiv CS 1d ago

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Announce Type: replace Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlations between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data...

arXiv CS 9d ago

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

Announce Type: new Abstract: Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization...

arXiv CS 2d ago

Spatial Transcriptomics as Images for Large-Scale Pretraining

arXiv:2603.13432v4 Announce Type: replace Abstract: Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, preserving spatial context essential for clinical and pathological studies. With rising sequencing throughput and advancing platforms, the expanding data volumes motivate large-scale ST pretraining. However, the fundamental unit for pretraining, i.e., what constitutes a single training sample, remains...

arXiv CS 6d ago

KletterMix: Climbing Toward High-Quality German Pretraining Data

arXiv:2606.03773v1 Announce Type: new Abstract: High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the...

arXiv CS 7d ago

Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection

Announce Type: new Abstract: Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect detection framework that combines structure-guided mixed masked pretraining with spatial continuity regularization. In the pretraining stage, we design a sparse...

arXiv CS 7d ago

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

arXiv:2606.05936v1 Announce Type: new Abstract: Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time...

arXiv CS 5d ago

Speedrunning Tabular Foundation Model Pretraining

arXiv:2606.03681v1 Announce Type: new Abstract: Pretraining cost is a major bottleneck for research on tabular foundation models, slowing the iteration cycle for new architectures, priors, and optimization ideas. Yet the community lacks a simple way to compare and accumulate pretraining speedups.

arXiv CS 7d ago