Distribution Estimation in Discounted MDPs via a Transformation

04/16/2018
by   Shuai Ma, et al.
0

Although the general deterministic reward function in MDPs takes three arguments - current state, action, and next state; it is often simplified to a function of two arguments - current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risk-sensitive - e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. This paper studies the distribution estimation of the cumulative discounted reward in infinite-horizon MDPs with finite state and action spaces. First, by taking the Value-at-Risk (VaR) objective as an example, we illustrate and analyze the error from the above simplification on the reward distribution. Next, we propose a transformation for MDPs to preserve the reward distribution and convert transition-based reward functions to deterministic state-based reward functions. This transformation works whether the transition-based reward function is deterministic or stochastic. Lastly, we show how to estimate the reward distribution after applying the proposed transformation in different settings, provided that the distribution is approximately normal.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2016

Effect of Reward Function Choices in MDPs with Value-at-Risk

This paper studies Value-at-Risk (VaR) problems in short- and long-horiz...
research
07/15/2020

Identifying Reward Functions using Anchor Actions

We propose a reward function estimation framework for inverse reinforcem...
research
06/25/2017

Specifying Non-Markovian Rewards in MDPs Using LDL on Finite Traces (Preliminary Version)

In Markov Decision Processes (MDPs), the reward obtained in a state depe...
research
11/08/2017

Inverse Reward Design

Autonomous agents optimize the reward function we give them. What they d...
research
02/14/2012

A Geometric Traversal Algorithm for Reward-Uncertain MDPs

Markov decision processes (MDPs) are widely used in modeling decision ma...
research
09/27/2022

Defining and Characterizing Reward Hacking

We provide the first formal definition of reward hacking, a phenomenon w...
research
08/20/2022

Calculus on MDPs: Potential Shaping as a Gradient

In reinforcement learning, different reward functions can be equivalent ...

Please sign up or login with your details

Forgot password? Click here to reset