The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang 1 min read

Key Points

Announce Type: replace Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit...

arXiv:2601.15165v4 Announce Type: replace Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

RL (ORG) Group Relative Policy Optimization (ORG) GRPO (ORG)

Originally published by arXiv CS Read original →

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

Related Stories

When 'Island Nemo' went missing, locals suspected foul play

Artificial turf contains 400 chemicals tied to cancer and hormone disruption. But is it unsafe?

Japan’s Retail Investor Army Flocks to SpaceX After IPO Drought

NASA addresses criticism over all-male crew selected for Artemis III test mission