Sample-Efficient Reinforcement Learning with Maximum Entropy Mellowmax Episodic Control

11/21/2019
by   Marta Sarrico, et al.
0

Deep networks have enabled reinforcement learning to scale to more complex and challenging domains, but these methods typically require large quantities of training data. An alternative is to use sample-efficient episodic control methods: neuro-inspired algorithms which use non-/semi-parametric models that predict values based on storing and retrieving previously experienced transitions. One way to further improve the sample efficiency of these approaches is to use more principled exploration strategies. In this work, we therefore propose maximum entropy mellowmax episodic control (MEMEC), which samples actions according to a Boltzmann policy with a state-dependent temperature. We demonstrate that MEMEC outperforms other uncertainty- and softmax-based exploration methods on classic reinforcement learning environments and Atari games, achieving both more rapid learning and higher final rewards.

READ FULL TEXT
research
01/23/2021

Rethinking Exploration for Sample-Efficient Policy Learning

Off-policy reinforcement learning for control has made great strides in ...
research
06/14/2018

Maximum a Posteriori Policy Optimisation

We introduce a new algorithm for reinforcement learning called Maximum a...
research
04/21/2023

On the Importance of Exploration for Real Life Learned Algorithms

The quality of data driven learning algorithms scales significantly with...
research
05/31/2022

k-Means Maximum Entropy Exploration

Exploration in high-dimensional, continuous spaces with sparse rewards i...
research
10/24/2022

MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer Sampling

Data selection is essential for any data-based optimization technique, s...
research
02/24/2023

Model-Based Uncertainty in Value Functions

We consider the problem of quantifying uncertainty over expected cumulat...
research
12/16/2016

An Alternative Softmax Operator for Reinforcement Learning

A softmax operator applied to a set of values acts somewhat like the max...

Please sign up or login with your details

Forgot password? Click here to reset