Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret

by   Orestis Papadigenopoulos, et al.

The stochastic multi-armed bandit setting has been recently studied in the non-stationary regime, where the mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played. This model captures natural behavioral aspects of the users which crucially determine the performance of recommendation platforms, ad placement systems, and more. Even assuming prior knowledge of the mean payoff functions, computing an optimal planning in the above model is NP-hard, while the state-of-the-art is a 1/4-approximation algorithm for the case where at most one arm can be played per round. We first focus on the setting where the mean payoff functions are known. In this setting, we significantly improve the best-known guarantees for the planning problem by developing a polynomial-time (1-1/e)-approximation algorithm (asymptotically and in expectation), based on a novel combination of randomized LP rounding and a time-correlated (interleaved) scheduling method. Furthermore, our algorithm achieves improved guarantees – compared to prior work – for the case where more than one arm can be played at each round. Moving to the bandit setting, when the mean payoff functions are initially unknown, we show how our algorithm can be transformed into a bandit algorithm with sublinear regret.


page 1

page 2

page 3

page 4


A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits

We study the non-stationary stochastic multi-armed bandit problem, where...

Last Switch Dependent Bandits with Monotone Payoff Functions

In a recent work, Laforgue et al. introduce the model of last switch dep...

Contextual Blocking Bandits

We study a novel variant of the multi-armed bandit problem, where at eac...

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

We study a sequential decision problem where the learner faces a sequenc...

Combinatorial Blocking Bandits with Stochastic Delays

Recent work has considered natural variations of the multi-armed bandit ...

Recurrent Submodular Welfare and Matroid Blocking Bandits

A recent line of research focuses on the study of the stochastic multi-a...

Please sign up or login with your details

Forgot password? Click here to reset