Model-Based Reinforcement Learning Is Minimax-Optimal for Offline Zero-Sum Markov Games

06/08/2022
by   Yuling Yan, et al.
4

This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a γ-discounted infinite-horizon Markov game with S states, where the max-player has A actions and the min-player has B actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds – called VI-LCB-Game – that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than C_𝖼𝗅𝗂𝗉𝗉𝖾𝖽^⋆S(A+B)/(1-γ)^3ε^2 (up to some log factor). Here, C_𝖼𝗅𝗂𝗉𝗉𝖾𝖽^⋆ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within (0,1/1-γ]. Our sample complexity bound strengthens prior art by a factor of min{A,B}, achieving minimax optimality for the entire ε-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset