Home Knowledge Base Pretraining Data Detection

Pretraining Data Detection

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Announce Type: replace Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlations between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data...

arXiv CS 9d ago

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

arXiv:2606.07996v1 Announce Type: new Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only...

arXiv CS 1d ago

On Revisiting Entropy for Identifying Mislabeled Images

arXiv:2605.31090v1 Announce Type: new Abstract: Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively...

arXiv CS 9d ago

FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models

arXiv:2605.09081v4 Announce Type: replace Abstract: We introduce the first universal pretraining corpus for industrial time-series data: 51M datapoints across 23k end-to-end task executions (13.3k real, 9.8k synthetic) on six embodiments, unified by a shared schema that enables robust zero-shot cross-embodiment transfer and highly parameter-efficient anomaly detection. We introduce a novel schema: Setpoint, Effort, Feedback, Context (S-E-F-C) underlying the whole pipeline that maps any...

arXiv CS 6d ago

Human-Like Neural Nets by Catapulting

Human-like Neural Nets by Catapulting Speculative proposal to create artificial neural nets with human-like performance by high-learning-rate/regularization training of overparameterized NNs to trigger catapulting/grokking. Over-parameterization as a route to true generalization would resolve many outstanding mysteries of artificial versus natural intelligence. There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are...

Hacker News 3d ago

Clustering Guided Domain-Specific Pretrained Foundation Model Very High-Resolution Arctic Remote Sensing

new Abstract: This study introduces a novel Arctic-focused remote sensing foundation model (RSFM) by combining diversity-aware regional-scale image curation with masked autoencoder (MAE) self-supervised pretraining of a Vision Transformer (ViT) encoder for very-high-spatial-resolution (VHSR) satellite image analysis. Spectral and acquisition-metadata descriptors were used in a scalable affinity-propagation clustering workflow to select approximately 3 million chips from 267 TB of Vantor VHSR...

arXiv CS 9d ago

Clustering Guided Domain-Specific Pretrained Foundation Model for Very High-Resolution Arctic Remote Sensing

arXiv:2605.30467v2 Announce Type: replace Abstract: This study introduces a novel Arctic-focused remote sensing foundation model (RSFM) by combining diversity-aware regional-scale image curation with masked autoencoder (MAE) self-supervised pretraining of a Vision Transformer (ViT) encoder for very-high-spatial-resolution (VHSR) satellite image analysis. Spectral and acquisition-metadata descriptors were used in a scalable affinity-propagation clustering workflow to select approximately 3...

arXiv CS 5d ago

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

Announce Type: new Abstract: In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to...

arXiv CS 1d ago

'The best solution is to murder him in his sleep': AI can learn violent tendencies from each other despite zero references to violence in training data

'The best solution is to murder him in his sleep': AI can learn violent tendencies from each other despite zero references to violence in training data Scientists found that AI models can inherit a taste for murder (or owls) from other models' training data. Large language models (LLMs) are secretly teaching each other unwanted habits through seemingly benign training data, scientists say. The phenomenon, known as "subliminal learning," occurs when a pretrained "teacher" artificial...

Live Science 5d ago

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

arXiv:2606.09430v1 Announce Type: new Abstract: Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving...

arXiv CS 1d ago