Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning

09/11/2019
by   Felix Leibfried, et al.
0

Cumulative entropy regularization introduces a regulatory signal to the reinforcement learning (RL) problem that encourages policies with high-entropy actions, which is equivalent to enforcing small deviations from a uniform reference marginal policy. This has been shown to improve exploration and robustness, and it tackles the value overestimation problem. It also leads to a significant performance increase in tabular and high-dimensional settings, as demonstrated via algorithms such as soft Q-learning (SQL) and soft actor-critic (SAC). Cumulative entropy regularization has been extended to optimize over the reference marginal policy instead of keeping it fixed, yielding a regularization that minimizes the mutual information between states and actions. While this has been initially proposed for Markov Decision Processes (MDPs) in tabular settings, it was recently shown that a similar principle leads to significant improvements over vanilla SQL in RL for high-dimensional domains with discrete actions and function approximators. Here, we follow the motivation of mutual-information regularization from an inference perspective and theoretically analyze the corresponding Bellman operator. Inspired by this Bellman operator, we devise a novel mutual-information regularized actor-critic learning (MIRACLE) algorithm for continuous action spaces that optimizes over the reference marginal policy. We empirically validate MIRACLE in the Mujoco robotics simulator, where we demonstrate that it can compete with contemporary RL methods. Most notably, it can improve over the model-free state-of-the-art SAC algorithm which implicitly assumes a fixed reference policy.

READ FULL TEXT
research
06/02/2020

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Policy entropy regularization is commonly used for better exploration in...
research
06/04/2017

Actor-Critic for Linearly-Solvable Continuous MDP with Partially Known Dynamics

In many robotic applications, some aspects of the system dynamics can be...
research
03/02/2019

A Unified Framework for Regularized Reinforcement Learning

We propose and study a general framework for regularized Markov decision...
research
06/17/2019

Hierarchical Soft Actor-Critic: Adversarial Exploration via Mutual Information Optimization

We describe a novel extension of soft actor-critics for hierarchical Dee...
research
11/11/2019

Real-Time Reinforcement Learning

Markov Decision Processes (MDPs), the mathematical framework underlying ...
research
11/28/2021

Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning

Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as So...
research
10/28/2021

Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates

Temporal-Difference (TD) learning methods, such as Q-Learning, have prov...

Please sign up or login with your details

Forgot password? Click here to reset