On Interpolating Experts and Multi-Armed Bandits
Learning with expert advice and multi-armed bandit are two classic online decision problems which differ on how the information is observed in each round of the game. We study a family of problems interpolating the two. For a vector ๐ฆ=(m_1,โฆ,m_K)โโ^K, an instance of ๐ฆ-MAB indicates that the arms are partitioned into K groups and the i-th group contains m_i arms. Once an arm is pulled, the losses of all arms in the same group are observed. We prove tight minimax regret bounds for ๐ฆ-MAB and design an optimal PAC algorithm for its pure exploration version, ๐ฆ-BAI, where the goal is to identify the arm with minimum loss with as few rounds as possible. We show that the minimax regret of ๐ฆ-MAB is ฮ(โ(Tโ_k=1^Klog (m_k+1))) and the minimum number of pulls for an (ฯต,0.05)-PAC algorithm of ๐ฆ-BAI is ฮ(1/ฯต^2ยทโ_k=1^Klog (m_k+1)). Both our upper bounds and lower bounds for ๐ฆ-MAB can be extended to a more general setting, namely the bandit with graph feedback, in terms of the clique cover and related graph parameters. As consequences, we obtained tight minimax regret bounds for several families of feedback graphs.
READ FULL TEXT