Least Square Value Iteration is Robust Under Locally Bounded Misspecification Error
The success of reinforcement learning heavily relies on the function approximation of policy, value or models, where misspecification (a mismatch between the ground-truth and best function approximators) naturally occurs especially when the ground-truth is complex. As misspecification error does not vanish even with infinite number of samples, designing algorithms that are robust under misspecification is of paramount importance. Recently, it is shown that policy-based approaches can be robust even when the policy function approximation is under a large locally-bounded misspecification error, with which the function class may have Ω(1) approximation error in certain states and actions but is only small on average under a policy-induced state-distribution; whereas it is only known that value-based approach can effectively learn under globally-bounded misspecification error, i.e., the approximation errors to value functions have a uniform upper bound on all state-actions. Yet it remains an open question whether similar robustness can be achieved with value-based approaches. In this paper, we answer this question affirmatively by showing that the algorithm, Least-Square-Value-Iteration [Jin et al, 2020], with carefully designed exploration bonus can achieve robustness under local misspecification error bound. In particular, we show that algorithm achieves a regret bound of O(√(d^3KH^4) + dKH^2ζ), where d is the dimension of linear features, H is the length of the episode, K is the total number of episodes, and ζ is the local bound of the misspecification error. Moreover, we show that the algorithm can achieve the same regret bound without knowing ζ and can be used as robust policy evaluation oracle that can be applied to improve sample complexity in policy-based approaches.
READ FULL TEXT