Nonparametric General Reinforcement Learning

11/28/2016
by   Jan Leike, et al.
0

Reinforcement learning (RL) problems are often phrased in terms of Markov decision processes (MDPs). In this thesis we go beyond MDPs and consider RL in environments that are non-Markovian, non-ergodic and only partially observable. Our focus is not on practical algorithms, but rather on the fundamental underlying problems: How do we balance exploration and exploitation? How do we explore optimally? When is an agent optimal? We follow the nonparametric realizable paradigm. We establish negative results on Bayesian RL agents, in particular AIXI. We show that unlucky or adversarial choices of the prior cause the agent to misbehave drastically. Therefore Legg-Hutter intelligence and balanced Pareto optimality, which depend crucially on the choice of the prior, are entirely subjective. Moreover, in the class of all computable environments every policy is Pareto optimal. This undermines all existing optimality properties for AIXI. However, there are Bayesian approaches to general RL that satisfy objective optimality guarantees: We prove that Thompson sampling is asymptotically optimal in stochastic environments in the sense that its value converges to the value of the optimal policy. We connect asymptotic optimality to regret given a recoverability assumption on the environment that allows the agent to recover from mistakes. Hence Thompson sampling achieves sublinear regret in these environments. Our results culminate in a formal solution to the grain of truth problem: A Bayesian agent acting in a multi-agent environment learns to predict the other agents' policies if its prior assigns positive probability to them (the prior contains a grain of truth). We construct a large but limit computable class containing a grain of truth and show that agents based on Thompson sampling over this class converge to play Nash equilibria in arbitrary unknown computable multi-agent environments.

READ FULL TEXT
research
09/16/2016

A Formal Solution to the Grain of Truth Problem

A Bayesian agent acting in a multi-agent environment learns to predict t...
research
02/25/2016

Thompson Sampling is Asymptotically Optimal in General Environments

We discuss a variant of Thompson sampling for nonparametric reinforcemen...
research
06/01/2023

Achieving Fairness in Multi-Agent Markov Decision Processes Using Reinforcement Learning

Fairness plays a crucial role in various multi-agent systems (e.g., comm...
research
10/16/2015

Bad Universal Priors and Notions of Optimality

A big open question of algorithmic information theory is the choice of t...
research
04/17/2002

Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures

The problem of making sequential decisions in unknown probabilistic envi...
research
05/30/2019

On Value Functions and the Agent-Environment Boundary

When function approximation is deployed in reinforcement learning (RL), ...
research
07/11/2022

Grounding Aleatoric Uncertainty in Unsupervised Environment Design

Adaptive curricula in reinforcement learning (RL) have proven effective ...

Please sign up or login with your details

Forgot password? Click here to reset