Joint Inference of Reward Machines and Policies for Reinforcement Learning

by   Zhe Xu, et al.

Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, i.e., a type of Mealy machine that encodes the reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It derives q-functions from the current hypothesis reward machine, and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relationship we defined between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit if a minimal reward machine can be inferred and the maximal length of each RL episode is sufficiently long. The experiments show that learning high-level knowledge in the form of reward machines can lead to fast convergence to optimal policies in RL, while standard RL methods such as q-learning and hierarchical RL methods fail to converge to optimal policies after a substantial number of training steps in many tasks.


page 1

page 2

page 3

page 4


SIBRE: Self Improvement Based REwards for Reinforcement Learning

We propose a generic reward shaping approach for improving rate of conve...

Omega-Regular Reward Machines

Reinforcement learning (RL) is a powerful approach for training agents t...

Joint Learning of Reward Machines and Policies in Environments with Partially Known Semantics

We study the problem of reinforcement learning for a task encoded by a r...

Reward-Machine-Guided, Self-Paced Reinforcement Learning

Self-paced reinforcement learning (RL) aims to improve the data efficien...

A Hierarchical Bayesian Approach to Inverse Reinforcement Learning with Symbolic Reward Machines

A misspecified reward can degrade sample efficiency and induce undesired...

Generic Itemset Mining Based on Reinforcement Learning

One of the biggest problems in itemset mining is the requirement of deve...

Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning

We generalise the problem of reward modelling (RM) for reinforcement lea...

Please sign up or login with your details

Forgot password? Click here to reset