Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

by   Yi Shen, et al.

Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.


page 1

page 2

page 3

page 4


Distributionally Robust Optimization Efficiently Solves Offline Reinforcement Learning

Offline reinforcement learning aims to find the optimal policy from a pr...

Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic

Learning a policy using only observational data is challenging because t...

Improved Sample Complexity Bounds for Distributionally Robust Reinforcement Learning

We consider the problem of learning a control policy that is robust agai...

Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

Off-policy evaluation and learning (OPE/L) use offline observational dat...

Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification

In the field of reinforcement learning, because of the high cost and ris...

Smoothed f-Divergence Distributionally Robust Optimization: Exponential Rate Efficiency and Complexity-Free Calibration

In data-driven optimization, sample average approximation is known to su...

Distributed Distributionally Robust Optimization with Non-Convex Objectives

Distributionally Robust Optimization (DRO), which aims to find an optima...

Please sign up or login with your details

Forgot password? Click here to reset