The In-Sample Softmax for Offline Reinforcement Learning

02/28/2023
by   Chenjun Xiao, et al.
0

Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an in-sample max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample softmax using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also well-suited to fine-tuning.

READ FULL TEXT

page 9

page 19

research
10/02/2021

BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning

Online interactions with the environment to collect data samples for tra...
research
01/30/2023

Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning

We propose A-Crab (Actor-Critic Regularized by Average Bellman error), a...
research
06/09/2023

In-Sample Policy Iteration for Offline Reinforcement Learning

Offline reinforcement learning (RL) seeks to derive an effective control...
research
08/16/2021

Implicitly Regularized RL with Implicit Q-Values

The Q-function is a central quantity in many Reinforcement Learning (RL)...
research
12/02/2018

Revisiting the Softmax Bellman Operator: Theoretical Properties and Practical Benefits

The softmax function has been primarily employed in reinforcement learni...
research
07/21/2020

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

Off-policy reinforcement learning (RL) holds the promise of sample-effic...
research
01/05/2023

Extreme Q-Learning: MaxEnt RL without Entropy

Modern Deep Reinforcement Learning (RL) algorithms require estimates of ...

Please sign up or login with your details

Forgot password? Click here to reset