Evaluation of LLMs for Mathematical Formalization in Lean

arXiv CS Friday 05 June 2026, 04:00 UTC By Tyson Klingner, Drew Bladek, Escher Crawford, Bohao Chen, Ariel Fu, Kaira Nair, Jarod Alper, Giovanni Inchiostro, Vasily Ilin 1 min read

Key Points

arXiv:2606.05632v1 Announce Type: new Abstract: Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\$0.01$ per correct proof.

Lean 4 (EVENT) miniF2F (ORG) Gemini (ORG) Claude Opus (PERSON) Opus 4.7 (ORG) NVIDIA Nemotron 3 Super (ORG) GPT-OSS (ORG)

Originally published by arXiv CS Read original →

Nasa chief defends choice of all-male Artemis III crew Critics fear the agency is following Trump’s order to eliminate diversity and inclusion efforts despite its vow to put a woman on the moon Nasa’s administrator Jared Isaacman on Wednesday defended the make-up of the space agency’s latest Artemis crew, an all-male group. The nominations have earned criticism that Nasa may have acted in accordance with US President Donald Trump’s direction to eliminate diversity and inclusion efforts....

South China Morning Post 18m ago

The asteroid that wiped out the dinosaurs may have created a vast underground habitat for life that lasted 8 million years

The asteroid that wiped out the dinosaurs may have created a vast underground habitat for life that lasted 8 million years The Chicxulub impact may have actually helped nurture life while destroying it, too. The asteroid impact that doomed the dinosaurs may also have built one of Earth's longest-lasting underground ecosystems. When a roughly 6-mile-wide (10-kilometer-wide) asteroid slammed into what is now Mexico's Yucatán Peninsula 66 million years ago, it triggered a global catastrophe...

Space.com 20m ago

See the 'crawling,' ball-shaped robot that rolled around the moon during Japan's historic first landing

See the 'crawling,' ball-shaped robot that rolled around the moon during Japan's historic first landing A morphable moon robot operated for 100 minutes in 2024, allowing investigators to get images of an upside-down spacecraft on the lunar surface. When the Japanese Smart Lander for Investigating Moon (SLIM) spacecraft, nicknamed the "Moon Sniper," face-planted onto the lunar surface in 2024, an experimental rover told Earth scientists what happened. Rolling autonomously through the lunar...

Live Science 20m ago

Evaluation of LLMs for Mathematical Formalization in Lean

Related Stories

'Worrying' pollution in Cotswolds river - volunteers

Nasa chief defends choice of all-male Artemis III crew

The asteroid that wiped out the dinosaurs may have created a vast underground habitat for life that lasted 8 million years

See the 'crawling,' ball-shaped robot that rolled around the moon during Japan's historic first landing