Yoshua Bengio’s Speech in NIPS 2019 by Lex Fridman

0:00 - Introduction

0:56 - The State of Deep Learning

2:20 - System 1 and System 2

4:09 - What’s currently missing in deep learning

8:56 - Out-of-distribution generalization

12:16 - Compositionality

16:08 - Contrast with Symbolic AI

18:11 - Attention and Consciousness

27:07 - Consciousness prior

32:17 - Meta-learning

39:09 - Operating on sets of objects

41:21 - Recap

44:48 - Question: moral implications of building machines that are conscious

46:18 - Question: Integrated information theory

47:17 - Question: Spatial prior

48:28 - Question: Symbolic AI

50:51 - Question: What is a data distribution?

52:05 - Question: measuring progress

52:52 - Question: causality

53:34 - Question: relation and associative memory

A Sum-up

State of Deep Learning

  • Big brain, more data, more computer speed.

System 1 vs System 2

  • System 1: fast uncouncious.

  • System 2: Slow, algorithmicaly concioussnes.

  • 1 -> 2: High level semantic representation, composiotionality, causality.

System 2 Roadmap

  • changes in distributions with time, attention and consciousness.

  • consciousness prior: sparse factor graph, meta-learning, localized change hypothesis -> causal discovery.

  • compositional DL Architectures: change in set that points to properties.

Change in Distribution

  • IID generalization: Nature does not shuffle the data, we should not too, this destroy distribution changes.

  • OOD generalization: Agent Learning need to be able to generalize to another distribution of the agent in the previous environment, because environment is non-stationary.

  • Compositionally of architectures helps IID and OOD generalization.

  • Systematic Generalization: Dynamic recombining concepts according to addition of diverse agents and the environment changing, including boundaries.

Consciousness and Attention

  • Focus on one or few elements at a time.

  • Content based soft-attention.

  • Attention is like internal action of an agent.

  • Key to keep track of object attention independent of transformer arch ordering.

  • Formalize testing of functionalities and test evolutionary effects.

Consciousness and Language

  • Learn language from environment.

Consciousness Prior: Sparse Factor Graph

  • Two Levels: Large Unconscious State, Tiny Conscious State.

  • Variables manipulable with language.

  • Disentangle factor for these variables, they are marginally independent.

  • Prior: Sparse Graph joint distribution between variables.


  • Multiple time-scales of learning

  • Inner loop optimization, outer-loop optimization.

  • Evolutionary optimization.

Causality and Object distribution change generation.

  • Discoverying causal effects between factor graphs (suppose you have a photo of a hand dropping an apple, you wanna your network to be able to to know that the hand dropped the apple).

  • Making change in sets of objects instead of vectors.