Multi-action Offline Policy Learning with Bayesian Optimization
We study an offline multi-action policy learning algorithm based on doubly robust estimators from causal inference settings, using argmax linear policy function classes. For general policy classes, we establish the connection of the regret bound with a generalization of the VC dimension in higher dimensions and specialize this to prove optimal regret bounds for the argmax linear function class. We also study various optimization approaches to solving the non-smooth non-convex problem associated with the argmax linear class, including convex relaxation, softmax relaxation, and Bayesian optimization. We find that Bayesian optimization with the Gradient-based Adaptive Stochastic Search (GASS) algorithm consistently outperforms convex relaxation in terms of policy value, and is much faster compared to softmax relaxation. Finally, we apply the algorithms to simulated and warfarin dataset. On the warfarin dataset the offline algorithm trained using only a subset of features achieves state-of-the-art accuracy.
READ FULL TEXT