Non-stationary Linear Bandits Revisited
In this note, we revisit non-stationary linear bandits, a variant of stochastic linear bandits with a time-varying underlying regression parameter. Existing studies develop various algorithms and show that they enjoy an O(T^2/3(1+P_T)^1/3) dynamic regret, where T is the time horizon and P_T is the path-length that measures the fluctuation of the evolving unknown parameter. However, we discover that a serious technical flaw makes the argument ungrounded. We revisit the analysis and present a fix. Without modifying original algorithms, we can prove an O(T^3/4(1+P_T)^1/4) dynamic regret for these algorithms, slightly worse than the rate as was anticipated. We also show some impossibility results for the key quantity concerned in the regret analysis. Note that the above dynamic regret guarantee requires an oracle knowledge of the path-length P_T. Combining the bandit-over-bandit mechanism, we can also achieve the same guarantee in a parameter-free way.
READ FULL TEXT