Maximizing the Total Reward via Reward Tweaking

02/09/2020
by   Chen Tessler, et al.
12

In reinforcement learning, the discount factor γ controls the agent's effective planning horizon. Traditionally, this parameter was considered part of the MDP; however, as deep reinforcement learning algorithms tend to become unstable when the effective planning horizon is long, recent works refer to γ as a hyper-parameter. In this work, we focus on the finite-horizon setting and introduce reward tweaking. Reward tweaking learns a surrogate reward function r̃ for the discounted setting, which induces an optimal (undiscounted) return in the original finite-horizon task. Theoretically, we show that there exists a surrogate reward which leads to optimality in the original task and discuss the robustness of our approach. Additionally, we perform experiments in a high-dimensional continuous control task and show that reward tweaking guides the agent towards better long-horizon returns when it plans for short horizons using the tweaked reward.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro