Model-Based Reinforcement Learning Is Minimax-Optimal for Offline Zero-Sum Markov Games
This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a γ-discounted infinite-horizon Markov game with S states, where the max-player has A actions and the min-player has B actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds – called VI-LCB-Game – that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than C_𝖼𝗅𝗂𝗉𝗉𝖾𝖽^⋆S(A+B)/(1-γ)^3ε^2 (up to some log factor). Here, C_𝖼𝗅𝗂𝗉𝗉𝖾𝖽^⋆ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within (0,1/1-γ]. Our sample complexity bound strengthens prior art by a factor of min{A,B}, achieving minimax optimality for the entire ε-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
READ FULL TEXT