Technology
A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time
Key Points
arXiv:2603.10742v4 Announce Type: replace Abstract: Data leakage has been identified in 648 published papers across 30 scientific fields. The knowledge to prevent it has existed for over a decade; the problem persists because the tools do not enforce what the textbooks teach. This paper presents a grammar (eight typed primitives connected by a directed acyclic graph with four hard constraints) that makes the most damaging leakage types structurally unrepresentable within the grammar's scope.
arXiv:2603.10742v4 Announce Type: replace
Abstract: Data leakage has been identified in 648 published papers across 30 scientific fields. The knowledge to prevent it has existed for over a decade; the problem persists because the tools do not enforce what the textbooks teach. This paper presents a grammar (eight typed primitives connected by a directed acyclic graph with four hard constraints) that makes the most damaging leakage types structurally unrepresentable within the grammar's scope. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary documented in the peer-reviewed ML methodology literature (to my knowledge, as of May 2026), backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.