Home › Business & Finance › HMPO: Hybrid Median-length Policy Optimization for...

Business & Finance

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan 1 min read

Key Points

arXiv:2606.01934v1 Announce Type: new Abstract: Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

HMPO (ORG) Policy Optimization for Chain (ORG) Policy Optimization (ORG) MoE (PERSON)

Originally published by arXiv CS Read original →

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

Related Stories

USDA's Rollins called screwworm a 'little pest' amid U.S. spread. Last year, she called it 'terrifying'

Why Nike Keeps Stumbling

SpaceX Tells Investors It Has Lined Up Blue-Chip Credit Ratings

Worried that big IPOs will torpedo the stock market? These factors suggest otherwise.