DOLCIT Seminar
Deep reinforcement learning (RL) has shown promising results for learning complex sequential decision-making behaviors in various environments. However, most successes have been exclusively in simulation, and results in real-world applications such as robotics are limited, largely due to poor sample efficiency of typical deep RL algorithms. In this talk, I will present methods to improve sample efficiency of these algorithms, blurring the boundaries among classic model-based RL, off-policy and on-policy model-free RL. The first part of the talk will discuss Q-Prop, a control variate technique for policy gradient that combines on-policy and off-policy learning and discusses empirical results and theoretical variance reduction. The second part of the talk focuses on temporal difference models (TDMs), an extension of goal-conditioned value functions that enables multi time resolution model-base planning. TDMs generalize traditional predictive models, bridge the gap between model-based and off-policy model-free RL, and empirically lead to substantial improvements in sample efficiency with vectorized implementation.