Home › Business & Finance › GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

Business & Finance

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv CS Monday 01 June 2026, 04:00 UTC By Yue Min, Ziyun Qiao, Ruining Chen, Yujun Li 1 min read

Key Points

arXiv:2605.26121v2 Announce Type: replace Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.

LLM (ORG) Euclidean (ORG) MM (PERSON) GEM (ORG) the Geometric Influence Score (ORG) RegMix (ORG)

Originally published by arXiv CS Read original →

As security threats grow and geopolitics shift, the European Union and Republic of South Korea have celebrated a new Digital Trade Agreement at a summit in Brussels. European Commission President Ursula von der Leyen, European Council President Antonio Costa and with South Korean President Lee Jae-myung celebrated the signing of new a digital trade agreement at a ceremony in Brussels on Wednesday. The event marked the EU and South Korea's 11th summit, with everything from security and...

Euronews 53m ago

Trump signs $70 billion immigration funding bill after months of delay

President Donald Trump on Wednesday signed a $70 billion bill to fund immigration enforcement agencies through the end of his term. The package to fund Immigration and Customs Enforcement and Customs and Border Protection passed out of Congress in the last week after months of debate and delays amid Democratic concerns about overly aggressive immigration enforcement. At a signing ceremony in the Oval Office on Wednesday, Trump said the bill would "give the heroes of ICE and border patrol ......

CNBC 53m ago

Pay what you wish: the restaurant where customers can eat for free – if their conscience lets them

Ever since the Post Modern Times cafe in Minneapolis ditched its price list, half the customers have chosen not to pay. It’s still making a profitName: Pay what you wish. Popular since the 00s, but dating back to at least the 80s.

The Guardian Business 54m ago

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

Related Stories

Warburg CEO Calls IPO Market ‘Broken’ Even Amid Giant Offerings

'Partners and friends’: Trade and defence top of agenda at EU-South Korea summit

Trump signs $70 billion immigration funding bill after months of delay

Pay what you wish: the restaurant where customers can eat for free – if their conscience lets them