Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

by   Jesus Bujalance Martin, et al.

During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. However, a well-known caveat of DRL algorithms is their inefficiency, requiring huge amounts of data to converge. Off-policy algorithms tend to be more sample-efficient, and can additionally benefit from any off-policy data stored in the replay buffer. Expert demonstrations are a popular source for such data: the agent is exposed to successful states and actions early on, which can accelerate the learning process and improve performance. In the past, multiple ideas have been proposed to make good use of the demonstrations in the buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We carry on a study to evaluate several of these ideas in isolation, to see which of them have the most significant impact. We also present a new method, based on a reward bonus given to demonstrations and successful episodes. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. The base algorithm for our experiments is the popular Soft Actor-Critic (SAC), a state-of-the-art off-policy algorithm for continuous action spaces. Our experiments focus on robotics, specifically on a reaching task for a robotic arm in simulation. We show that our method SACR2 based on reward relabeling improves the performance on this task, even in the absence of demonstrations.


page 5

page 6


Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

During recent years, deep reinforcement learning (DRL) has made successf...

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Policy entropy regularization is commonly used for better exploration in...

Pretrain Soft Q-Learning with Imperfect Demonstrations

Pretraining reinforcement learning methods with demonstrations has been ...

Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations

Providing densely shaped reward functions for RL algorithms is often exc...

Hierarchical Deep Q-Network with Forgetting from Imperfect Demonstrations in Minecraft

We present hierarchical Deep Q-Network with Forgetting (HDQF) that took ...

Just Round: Quantized Observation Spaces Enable Memory Efficient Learning of Dynamic Locomotion

Deep reinforcement learning (DRL) is one of the most powerful tools for ...

Benchmarking Deep Reinforcement Learning Algorithms for Vision-based Robotics

This paper presents a benchmarking study of some of the state-of-the-art...

Please sign up or login with your details

Forgot password? Click here to reset