Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

02/09/2018
by   Zeyuan Allen-Zhu, et al.
0

Regret bounds in online learning compare the player's performance to L^*, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon T. The more refined concept of first-order regret bound replaces this with a scaling √(L^*), which may be much smaller than √(T). It is well known that minor variants of standard algorithms satisfy first-order regret bounds in the full information and multi-armed bandit settings. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain first-order regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro