Bandit Linear Optimization
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates
arXiv:2606.00984v1 Announce Type: cross Abstract: We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many "strictly batched" methods additionally restrict within-interval context adaptivity, meaning that...
A Perturbation Approach to Unconstrained Linear Bandits
arXiv:2603.28201v2 Announce Type: replace Abstract: We revisit the standard perturbation-based approach of Abernethy et al. in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem.
An Improved Algorithm for Adversarial Linear Contextual Bandits via Reduction
arXiv:2508.11931v3 Announce Type: replace Abstract: We present an oracle-efficient, near-optimal algorithm for linear contextual bandits with adversarial losses and stochastic action sets, only requiring a linear optimization oracle for the action sets in each round. Our approach reduces this setting to misspecification-robust adversarial linear bandits with fixed action sets. Without knowledge of the context distribution or access to a context simulator, the algorithm achieves...
Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback
arXiv:2606.01081v1 Announce Type: new Abstract: Decision-focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and full observations of the objective cost vector. We develop an on-policy learning method for sequential contextual linear optimization under partial feedback, generalizing the standard bandit feedback setting.
Offline-to-Online Learning in Linear Bandits
Announce Type: new Abstract: We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows.
When and why randomised exploration works (in linear bandits)
arXiv:2502.08870v2 Announce Type: replace Abstract: We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an $n$-step regret bound of the order $O(d\sqrt{n} \log(n))$. Notably, this shows for the first time that there...
Batched Stochastic Linear Bandits with 1-Bit Communication Constraints
Announce Type: cross Abstract: We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pulls to an agent, who then observes the corresponding $B$ rewards and responds with a single bit of feedback to the learner. For each batch, the learner specifies the 1-bit quantization rule the agent uses, which may depend on all...
MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks
arXiv:2603.02630v2 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backbone of Multi-Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts, prompt optimization emerges as a more natural approach to improve its performance. However,...
Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts
arXiv:2606.09802v1 Announce Type: new Abstract: We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when...
Annealed Softmax Greedy in Many-Armed Bayesian Bandits
Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic...