Finite-Time Regret Analysis of Retry-Aware Bandits

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas 1 min read

Key Points

Announce Type: replace Abstract: We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems...

arXiv:2605.20854v2 Announce Type: replace Abstract: We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

Finite-Time Regret Analysis of Retry-Aware Bandits arXiv:2605.20854v2 (ORG) ReMax (LOCATION) Thompson (ORG) KL-UCB (ORG)

Originally published by arXiv CS Read original →

Finite-Time Regret Analysis of Retry-Aware Bandits

Related Stories

'Worrying' pollution in Cotswolds river - volunteers

See the 'crawling,' ball-shaped robot that rolled around the moon during Japan's historic first landing

The asteroid that wiped out the dinosaurs may have created a vast underground habitat for life that lasted 8 million years

'Doctor Who' Christmas Special is cancelled as Russell T. Davies departs the show: What does this mean for the future of Doctor Who?