Multi-Agent Off-Policy TD Learning: Finite-Time Analysis with Near-Optimal Sample Complexity and Communication Complexity

by   Ziyi Chen, et al.

The finite-time convergence of off-policy TD learning has been comprehensively studied recently. However, such a type of convergence has not been well established for off-policy TD learning in the multi-agent setting, which covers broader applications and is fundamentally more challenging. This work develops two decentralized TD with correction (TDC) algorithms for multi-agent off-policy TD learning under Markovian sampling. In particular, our algorithms preserve full privacy of the actions, policies and rewards of the agents, and adopt mini-batch sampling to reduce the sampling variance and communication frequency. Under Markovian sampling and linear function approximation, we proved that the finite-time sample complexity of both algorithms for achieving an ϵ-accurate solution is in the order of 𝒪(ϵ^-1lnϵ^-1), matching the near-optimal sample complexity of centralized TD(0) and TDC. Importantly, the communication complexity of our algorithms is in the order of 𝒪(lnϵ^-1), which is significantly lower than the communication complexity 𝒪(ϵ^-1lnϵ^-1) of the existing decentralized TD(0). Experiments corroborate our theoretical findings.


page 1

page 2

page 3

page 4

∙ 09/08/2021

Sample and Communication-Efficient Decentralized Actor-Critic Algorithms with Finite-Time Analysis

Actor-critic (AC) algorithms have been widely adopted in decentralized m...
∙ 04/27/2020

Improving Sample Complexity Bounds for Actor-Critic Algorithms

The actor-critic (AC) algorithm is a popular method to find an optimal p...
∙ 03/15/2012

Rollout Sampling Policy Iteration for Decentralized POMDPs

We present decentralized rollout sampling policy iteration (DecRSPI) - a...
∙ 01/05/2021

Neurosymbolic Transformers for Multi-Agent Communication

We study the problem of inferring communication structures that can solv...
∙ 03/05/2023

PRECISION: Decentralized Constrained Min-Max Learning with Low Communication and Sample Complexities

Recently, min-max optimization problems have received increasing attenti...
∙ 11/10/2020

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

Two timescale stochastic approximation (SA) has been widely used in valu...
∙ 12/06/2018

Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning

Despite the increasing interest in multi-agent reinforcement learning (M...

Please sign up or login with your details

Forgot password? Click here to reset