SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits
We consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and a collision occurs if the same arm is pulled by more than one player; this is a standard model of cognitive radio networks. We construct a decentralized algorithm that achieves the same performances as a centralized one, if players are synchronized and observe their collisions. We actually construct a communication protocol between players by enforcing willingly collisions, allowing them to share their exploration. With a weaker feedback, when collisions are not observed, we still maintain some communication between players but at the cost of some extra multiplicative term in the regret. We also prove that the logarithmic growth of the regret is still achievable in the dynamic case where players are not synchronized with each other, thus preventing communication. Finally, we prove that if all players follow naively the celebrated UCB algorithm, the total regret grows linearly.
READ FULL TEXT