IST Lunch Bunch
Off-policy Evaluation and Learning in Theory and in the Wild
The talk considers the problem of offline policy learning for automated decision systems under the contextual bandits model, where we aim at evaluating the performance of a given policy (a decision algorithm) and also learning a better policy using logged historical data consisting of context, actions, rewards and probabilities of the actions taken. This is a generalization of the Average Treatment Effect (ATE) estimation problem and has some interesting new set of desiderata to consider.
In the first part of the talk, I will compare and contrast off-policy evaluation and ATE estimation and clarify how different assumptions change the corresponding minimax risk in estimating the ``causal effect''. In addition, I will talk about how one can achieve significantly better finite sample performance than asymptotically optimal estimators through the SWITCH estimator.
In the second part of the talk, I will talk about off-policy learning in the real world. I will highlight some of the real world challenges include: missing logging probability, confounding variables (Simpson's paradox) and model misspecification. We will demonstrate that a commonly-used naive approach of direct cross-entropy minimization is implicitly optimizing a causal objective without requiring us to know the probabilities of taking actions. Then we propose policy imitation, which can be used as a regularization and as a test of whether there are confounders or model-misspecification.
Contact: Diane Goodfellow email@example.com