CATPO: Critique-Augmented Tree Policy Optimization

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Ayush Singh, Umang Goyal, Ankur Dahiya 1 min read

Key Points

Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already...

arXiv:2606.08346v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.

CATPO (ORG) OlympiadBench (ORG) MinervaMath (ORG) GRPO (ORG)

Originally published by arXiv CS Read original →

A Purple Heart recipient wounded in Afghanistan is speaking out after Reddit comments linked to Democratic Maine Senate candidate Graham Platner resurfaced, saying PTSD does not excuse mocking a wounded American service member. Speaking on "The Ingraham Angle," Pfc. Ted Daniels, who received a Purple Heart after surviving a Taliban attack, pushed back against efforts to explain Platner's comments by citing PTSD."Right now it appears that Graham Platner is the poster child for people who...

Fox News 25m ago

US military chief Hegseth warns Cuba against acquiring military arms

US military chief Hegseth warns Cuba against acquiring military arms Hegseth’s visit to Guantanamo Bay, Cuba, comes as the Trump administration increases pressure against Cuba’s government. Secretary of Defense Pete Hegseth has warned that Cuba could invite confrontation with the United States if it seeks to bolster its military capabilities with new purchases. Wednesday’s comments come as US President Donald Trump continues to threaten possible military intervention on the Caribbean island.

Al Jazeera 25m ago

Bill Gates 'deeply sorry' for Epstein ties in testimony to US politicians

Bill Gates 'deeply sorry' for Epstein ties in testimony to US politicians Thu 11 Jun 2026 at 4:09am In short: Bill Gates says he is "deeply sorry" if his connection to Jeffrey Epstein lent the sex offender any credibility and is denying having "victimised anyone". The Microsoft co-founder appeared before a closed-door US House Oversight Committee panel that has been investigating the late sex offender's connections with the rich and powerful. He is the latest high-profile figure to provide...

ABC Australia 25m ago

First on Fox: Trump admin opens new front in fraud crackdown targeting health insurers, drug middlemen

The Trump administration is lifting the hood on federal health benefits programs that cover millions of Americans, ordering insurance carriers to tighten fraud controls as part of a broader crackdown on waste and abuse, Fox News Digital learned. "Working alongside the White House Task Force to Eliminate Fraud, OPM is taking additional steps to safeguard the premiums paid by federal employees and taxpayers, protect beneficiaries, and ensure health insurance companies are meeting the highest...

Fox News Politics 27m ago

CATPO: Critique-Augmented Tree Policy Optimization

Related Stories

Purple Heart recipient mocked by Platner says PTSD does not excuse 'abhorrent behavior'

US military chief Hegseth warns Cuba against acquiring military arms

Bill Gates 'deeply sorry' for Epstein ties in testimony to US politicians

First on Fox: Trump admin opens new front in fraud crackdown targeting health insurers, drug middlemen