Extreme Q-Learning: MaxEnt RL without Entropy

01/05/2023
by   Divyansh Garg, et al.
0

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from Economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on some tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks.

READ FULL TEXT

page 20

page 22

page 24

page 25

research
03/14/2023

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Conventional reinforcement learning (RL) needs an environment to collect...
research
06/12/2021

A Minimalist Approach to Offline Reinforcement Learning

Offline reinforcement learning (RL) defines the task of learning from a ...
research
10/20/2020

Iterative Amortized Policy Optimization

Policy networks are a central feature of deep reinforcement learning (RL...
research
12/29/2022

Offline Policy Optimization in RL with Variance Regularizaton

Learning policies from fixed offline datasets is a key challenge to scal...
research
02/28/2023

The In-Sample Softmax for Offline Reinforcement Learning

Reinforcement learning (RL) agents can leverage batches of previously co...
research
02/12/2021

Q-Value Weighted Regression: Reinforcement Learning with Limited Data

Sample efficiency and performance in the offline setting have emerged as...
research
10/28/2020

Learning to Unknot

We introduce natural language processing into the study of knot theory, ...

Please sign up or login with your details

Forgot password? Click here to reset