Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence

01/27/2023
by   Lingwei Zhu, et al.
0

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence – called the Tsallis KL divergence – which use the q-logarithm in the definition. The approach is a strict generalization, as q = 1 corresponds to the standard KL divergence; q > 1 provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when q >1 could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI(q) obtains significant improvements over the standard MVI(q = 1) across 35 Atari games.

READ FULL TEXT

page 20

page 22

page 23

research
02/11/2021

Optimization Issues in KL-Constrained Approximate Policy Iteration

Many reinforcement learning algorithms can be seen as versions of approx...
research
02/16/2023

Aligning Language Models with Preferences through f-divergence Minimization

Aligning language models with preferences can be posed as approximating ...
research
05/27/2021

Optimistic Reinforcement Learning by Forward Kullback-Leibler Divergence Optimization

This paper addresses a new interpretation of reinforcement learning (RL)...
research
05/01/2015

Volumetric Bias in Segmentation and Reconstruction: Secrets and Solutions

Many standard optimization methods for segmentation and reconstruction c...
research
07/17/2021

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Approximate Policy Iteration (API) algorithms alternate between (approxi...
research
12/29/2017

f-Divergence constrained policy improvement

To ensure stability of learning, state-of-the-art generalized policy ite...
research
07/16/2021

Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning

The recent booming of entropy-regularized literature reveals that Kullba...

Please sign up or login with your details

Forgot password? Click here to reset