Learning from logged bandit feedback
Many of the most impactful applications of machine learning are not just about prediction, but are about putting learning systems in control of selecting the right action at the right time (e.g., search engines, recommender systems or automated trading platforms). These systems are both producers and users of data -- the logs of the selected actions and their outcomes (e.g., derived from clicks, ratings or revenue) can provide valuable training data for learning the next generation of the system, giving rise to some of the biggest datasets we have collected. Machine learning in these settings is challenging since the system in operation biases the log data through the actions it selects and outcomes remain unknown for the actions not taken. Learning methods must, hence, reason about how changes to the system will affect future outcomes. We will summarize recent advances in these counterfactual learning techniques, and demonstrate how deep neural networks can be trained in these settings (ICLR'18).
Joint work with Thorsten Joachims and Maarten de Rijke.