Pretraining Data Detection
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data
Announce Type: replace Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlations between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data...
MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models
arXiv:2606.07996v1 Announce Type: new Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only...
On Revisiting Entropy for Identifying Mislabeled Images
arXiv:2605.31090v1 Announce Type: new Abstract: Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively...
FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models
arXiv:2605.09081v4 Announce Type: replace Abstract: We introduce the first universal pretraining corpus for industrial time-series data: 51M datapoints across 23k end-to-end task executions (13.3k real, 9.8k synthetic) on six embodiments, unified by a shared schema that enables robust zero-shot cross-embodiment transfer and highly parameter-efficient anomaly detection. We introduce a novel schema: Setpoint, Effort, Feedback, Context (S-E-F-C) underlying the whole pipeline that maps any...
Human-Like Neural Nets by Catapulting
Human-like Neural Nets by Catapulting Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence. There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are...
Clustering Guided Domain-Specific Pretrained Foundation Model Very High-Resolution Arctic Remote Sensing
new Abstract: This study introduces a novel Arctic-focused remote sensing foundation model (RSFM) by combining diversity-aware regional-scale image curation with masked autoencoder (MAE) self-supervised pretraining of a Vision Transformer (ViT) encoder for very-high-spatial-resolution (VHSR) satellite image analysis. Spectral and acquisition-metadata descriptors were used in a scalable affinity-propagation clustering workflow to select approximately 3 million chips from 267 TB of Vantor VHSR...
Clustering Guided Domain-Specific Pretrained Foundation Model for Very High-Resolution Arctic Remote Sensing
arXiv:2605.30467v2 Announce Type: replace Abstract: This study introduces a novel Arctic-focused remote sensing foundation model (RSFM) by combining diversity-aware regional-scale image curation with masked autoencoder (MAE) self-supervised pretraining of a Vision Transformer (ViT) encoder for very-high-spatial-resolution (VHSR) satellite image analysis. Spectral and acquisition-metadata descriptors were used in a scalable affinity-propagation clustering workflow to select approximately 3...
Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation
Announce Type: new Abstract: In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to...
'The best solution is to murder him in his sleep': AI can learn violent tendencies from each other despite zero references to violence in training data
'The best solution is to murder him in his sleep': AI can learn violent tendencies from each other despite zero references to violence in training data Scientists found that AI models can inherit a taste for murder (or owls) from other models' training data. Large language models (LLMs) are secretly teaching each other unwanted habits through seemingly benign training data, scientists say. The phenomenon, known as "subliminal learning," occurs when a pretrained "teacher" artificial...
LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models
arXiv:2606.09430v1 Announce Type: new Abstract: Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving...