Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

by   Xiaoyu Chen, et al.

We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of Õ (poly(d H) √(K) ), where d is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, H is the planning horizon, K is the number of episodes, and Õ(·) omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with n-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.


Provably Efficient Reinforcement Learning via Surprise Bound

Value function approximation is important in modern reinforcement learni...

Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations

In real-world reinforcement learning (RL) systems, various forms of impa...

Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation

Risk-sensitive reinforcement learning (RL) aims to optimize policies tha...

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

In this paper we consider multi-objective reinforcement learning where t...

On the Theory of Reinforcement Learning with Once-per-Episode Feedback

We study a theory of reinforcement learning (RL) in which the learner re...

Bayes-Optimal Entropy Pursuit for Active Choice-Based Preference Learning

We analyze the problem of learning a single user's preferences in an act...

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Policy optimization methods are one of the most widely used classes of R...

Please sign up or login with your details

Forgot password? Click here to reset