Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment

by   Zixian Yang, et al.

Multi-armed bandit (MAB) is a classic model for understanding the exploration-exploitation trade-off. The traditional MAB model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok and YouTube Shorts, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where "A" stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not like the previous item. We prove that both ULCB and KL-ULCB achieve logarithmic regret, O(log K), where K is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results confirm our theoretical analysis and show that the proposed algorithms have significantly lower regrets than the traditional UCB and KL-UCB, and Q-learning-based algorithms.


page 1

page 2

page 3

page 4


Fiduciary Bandits

Recommendation systems often face exploration-exploitation tradeoffs: th...

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Multi-armed bandit problems are the most basic examples of sequential de...

CONQUER: Confusion Queried Online Bandit Learning

We present a new recommendation setting for picking out two items from a...

Bayesian Exploration with Heterogeneous Agents

It is common in recommendation systems that users both consume and produ...

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

We propose a new learning framework that captures the tiered structure o...

UCBoost: A Boosting Approach to Tame Complexity and Optimality for Stochastic Bandits

In this work, we address the open problem of finding low-complexity near...

The K-Nearest Neighbour UCB algorithm for multi-armed bandits with covariates

In this paper we propose and explore the k-Nearest Neighbour UCB algorit...

Please sign up or login with your details

Forgot password? Click here to reset