Entity Abstraction in Visual Model-Based Reinforcement Learning

Many robotic tasks are specified in terms of objects and their relationships, so naturally a learner that also represents a physical scene in terms of distinct objects and relations would enjoy broader generalization and more efficient planning than a learner that computes a monolithic global encoding of the sensory observations. However, for robots that learn from raw video data, the difficulty of grounding object representations in raw pixels without domain-specific supervision has led much work in end-to-end robotic learning to use monolithic global encodings, which require less domain-specific supervision but produce entangled representations that makes generalization and planning less straightforward. In particular, a seemingly trivial but important property we desire in our models is that they can transfer what they learn about a single object across different contexts: the functions that govern the appearance and dynamics of objects should not be affected by the number or configuration of other objects in the scene, yet this type of generalization is not straightforward to enforce with monolithic global encodings because these encodings generally entangle information about all objects together.

In this talk, I will present a model that combines the advantages of representing objects and relations with the advantages of learning from raw data: it processes a factorized latent representation that enforces the variable isolation required for transferring knowledge about a single object across different contexts, and it also learns entirely end-to-end from raw pixels without any manual specification about what objects are and what the factors in the representation should correspond to.

It learns to ground the latent factors on actual objects in the visual observation by executing an inference procedure on a probabilistic entity-factorized dynamic latent variable model.

I will show on both simulated block stacking problems and real robot interaction video data that many desirable properties for physical modeling and planning — such as temporal continuity, handling occlusion, unsupervised segmentation, and symbolic grounding — emerge in some form as direct consequences of the architecture of the model and the algorithm we use to train it, suggesting a promising direction for enabling better generalization and planning in real learning robotics.

For more information please see: https://arxiv.org/abs/1910.12827.