Home Science Disjoint Generation of Synthetic Data
Science

Disjoint Generation of Synthetic Data

Key Points

arXiv:2507.19700v2 Announce Type: replace Abstract: We propose a new framework for generating tabular synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers.

arXiv:2507.19700v2 Announce Type: replace Abstract: We propose a new framework for generating tabular synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that help illuminate some of the design choices that one may make. The advantages achieved by the disjoint generation include: i) An observed increase in the empirical measurement of privacy. ii) Increased computational feasibility of certain model types. iii) Ability to generate synthetic data using a mixture of different generative models. Specifically, mixed-model synthesis bridges the gap between privacy and utility performance, providing highly competitive performance on Accuracy and Area Under the Curve for downstream tasks while significantly lowering the empirical re-identification risk.
Disjoint Generation of Synthetic Data arXiv:2507.19700v2 (ORG)
Originally published by arXiv CS Read original →