Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback
We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into d components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest d rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs Ω(T) pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys o(T) policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for K-armed bandit and bandit convex optimization, we have 𝒪(T^2/3) policy regret bound. We also prove a matching lower bound for K-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in <cit.>, showing that non-oblivious delay is enough to incur Ω̃(T^2/3) regret.
READ FULL TEXT