Aligning Agent Policy with Externalities: Reward Design via Bilevel RL

by   Souradip Chakraborty, et al.

In reinforcement learning (RL), a reward function is often assumed at the outset of a policy optimization procedure. Learning in such a fixed reward paradigm in RL can neglect important policy optimization considerations, such as state space coverage and safety. Moreover, it can fail to encompass broader impacts in terms of social welfare, sustainability, or market stability, potentially leading to undesirable emergent behavior and potentially misaligned policy. To mathematically encapsulate the problem of aligning RL policy optimization with such externalities, we consider a bilevel optimization problem and connect it to a principal-agent framework, where the principal specifies the broader goals and constraints of the system at the upper level and the agent solves a Markov Decision Process (MDP) at the lower level. The upper-level deals with learning a suitable reward parametrization corresponding to the broader goals and the lower-level deals with learning the policy for the agent. We propose Principal driven Policy Alignment via Bilevel RL (PPA-BRL), which efficiently aligns the policy of the agent with the principal's goals. We explicitly analyzed the dependence of the principal's trajectory on the lower-level policy, prove the convergence of PPA-BRL to the stationary point of the problem. We illuminate the merits of this framework in view of alignment with several examples spanning energy-efficient manipulation tasks, social welfare-based tax design, and cost-effective robotic navigation.


page 14

page 15


Reward is enough for convex MDPs

Maximising a cumulative reward function that is Markov and stationary, i...

Assured Learning-enabled Autonomy: A Metacognitive Reinforcement Learning Framework

Reinforcement learning (RL) agents with pre-specified reward functions c...

Logically-Correct Reinforcement Learning

We propose a novel Reinforcement Learning (RL) algorithm to synthesize p...

On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

To achieve sample efficiency in reinforcement learning (RL), it necessit...

Insulin Regimen ML-based control for T2DM patients

We model individual T2DM patient blood glucose level (BGL) by stochasti...

COVID-19 Pandemic Cyclic Lockdown Optimization Using Reinforcement Learning

This work examines the use of reinforcement learning (RL) to optimize cy...

Bandit-Based Policy Invariant Explicit Shaping for Incorporating External Advice in Reinforcement Learning

A key challenge for a reinforcement learning (RL) agent is to incorporat...

Please sign up or login with your details

Forgot password? Click here to reset