GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

arXiv CS Monday 08 June 2026, 04:00 UTC By Yue Min, Ruining Chen, Yujun Li 1 min read

Key Points

Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and complementary coverage. In this work, we reframe attribution as subset-level counterfactual utility prediction and introduce GRASP, an interaction-aware surrogate.

arXiv:2606.06892v1 Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and complementary coverage. In this work, we reframe attribution as subset-level counterfactual utility prediction and introduce GRASP, an interaction-aware surrogate. Grounded in a theoretical smoothness lower bound, GRASP explicitly models subset interactions through a quadratic geometric penalty. To achieve pretraining-scale efficiency without relying on hidden oracle tuning, we couple low-dimensional feature sketches with a strictly finite lower-confidence bound selection protocol. Extensive subset-retraining evaluations demonstrate that GRASP decisively outperforms existing scalable baselines. It more than doubles the task-level rank correlation for counterfactual subset fidelity while reducing upfront artifact construction costs by nearly an order of magnitude. Downstream diagnostics further show that this scoring mechanism transfers to language model curation and cross-domain vision selection, establishing a robust foundation for optimizing massive pretraining corpora.

Originally published by arXiv CS Read original →

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

Related Stories

Senesi signs for Tottenham on free transfer from Bournemouth

Which rookies will drive in Barcelona practice as Hamilton, Antonelli sit out?

Sources: NHLPA eyes Babcock inquiry on '23 case

Rest? Play? All options open for Itoje's summer