IST Lunch Bunch

Tuesday November 6, 2018 12:00 PM

Off-policy Evaluation and Learning in Theory and in the Wild

Speaker: Yu-Xiang Wang, Computer Science Department, UC Santa Barbara
Location: Annenberg 105

The talk considers the problem of offline policy learning for automated decision systems under the contextual bandits model, where we aim at evaluating the performance of a given policy (a decision algorithm) and also learning a better policy using logged historical data consisting of context, actions, rewards and probabilities of the actions taken. This is a generalization of the Average Treatment Effect (ATE) estimation problem and has some interesting new set of desiderata to consider.

In the first part of the talk, I will compare and contrast off-policy evaluation and ATE estimation and clarify how different assumptions change the corresponding minimax risk in estimating the ``causal effect''. In addition, I will talk about how one can achieve significantly better finite sample performance than asymptotically optimal estimators through the SWITCH estimator.

In the second part of the talk, I will talk about off-policy learning in the real world. I will highlight some of the real world challenges include: missing logging probability, confounding variables (Simpson's paradox) and model misspecification. We will demonstrate that a commonly-used naive approach of direct cross-entropy minimization is implicitly optimizing a causal objective without requiring us to know the probabilities of taking actions.  Then we propose policy imitation, which can be used as a regularization and as a test of whether there are confounders or model-misspecification.

Series IST Lunch Bunch

Contact: Diane Goodfellow