Scaling Pre-training to One Hundred Billion Data for Vision Language Models

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai 1 min read

Key Points

Announce Type: replace Abstract: We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts.

arXiv:2502.07617v2 Announce Type: replace Abstract: We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.

CLIP (ORG)

Originally published by arXiv CS Read original →

The NHLPA expects a full NHL investigation of coach Mike Babcock before the Edmonton Oilers can hire him, sources told ESPN on Tuesday. The investigation would cover Babcock's time with the Columbus Blue Jackets in 2023, when he was hired but never coached a game for the team. Hired in July 2023, Babcock resigned that September after an NHLPA investigation into claims that he violated players' privacy when he asked to see photos on their cellphones.

ESPN just now

Trump doubles down on Pulte for DNI, calls for short-term extension of foreign surveillance law

President Donald Trump on Wednesday doubled down on his choice of Bill Pulte as acting director of national intelligence, despite bipartisan pushback on the pick that could result in the lapse this week of a foreign surveillance program with major national security implications. Earlier this month Trump tapped Pulte, who leads the Federal Housing Finance Agency and has used his perch to launch a series of probes into several of the president's political opponents over allegations of...

CNBC 19m ago

Trust in France's institutions 'at stake' after girl's killing, Emmanuel Macron says

The body of the girl, named as Lyhanna, was found last week after she went missing on 29 May in the southwestern town of Fleurance. President Emmanuel Macron said on Wednesday he feared for trust in France's institutions after a botched investigation into the main suspect in an 11-year-old girl's likely murder triggered public outrage. The body of the girl, named as Lyhanna, was found last week after she went missing on 29 May in the southwestern town of Fleurance.

Euronews 22m ago

Trump tears into Stephen A Smith as feud grows: 'Arrogant fool, a low IQ individual'

President Donald Trump took another swipe at ESPN personality Stephen A. Smith as the two traded barbs over the president’s attendance at the New York Knicks’ NBA Finals game. Smith initially said Trump’s attendance would be a detriment to NBA fans and the city. Trump was asked to respond to Smith’s comments by Fox News Digital/OutKick on Monday night.

Fox News 26m ago

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Related Stories

Sources: NHLPA eyes Babcock inquiry on '23 case

Trump doubles down on Pulte for DNI, calls for short-term extension of foreign surveillance law

Trust in France's institutions 'at stake' after girl's killing, Emmanuel Macron says

Trump tears into Stephen A Smith as feud grows: 'Arrogant fool, a low IQ individual'