Research Paper overview
How do we get machines to learn and act more like like humans and animals particularly being able to reason and plan?
Here’s our review of the 2022 Yann LeCun technical paper A Path Towards Autonomous Intelligence (LeCun, 2022).

Why is paper important?
It’s quite rare finding a single author paper grab so much attention. Yann LeCun
- Meta Chief AI Scientist
- Turing Award winner
- Meta has continued to progress this idea and has since published
- I-JEPA (a foundation model for various kinds of image tasks, Assran et al, 2023) and
- V-JEPA (a foundation model for video tasks, Bardes et al, 2024).
despite all the hype around ChatGPT and large language models, it’s still bad.
Not because it doesn’t work—it clearly does. But compared to how a human child learns about the world, our most advanced AI systems are embarrassingly inefficient. A teenager learns to drive in 20 hours. We’ve poured billions into self-driving cars for over a decade. A 10-year-old can clear a dinner table and load a dishwasher. No robot can reliably do this. An 8-month-old discovers gravity by repeatedly dropping toys from a high chair. GPT-4 has processed more text than a human could read in 22,000 years, yet it doesn’t truly understand that objects fall.
This is Moravec’s Paradox in action.

Computers excel at what humans find hard (chess, calculus, generating fluent text) but fail spectacularly at what we find trivially easy (understanding that objects persist when hidden, navigating a cluttered room, having common sense).
The Core Problem: Why LLMs Can Never Truly Understand
Large Language Models like ChatGPT operate through what he calls “auto-regressive generation”, predicting one token at a time with a fixed amount of computation per token. This creates several critical limitations:
- Exponential Divergence Problem
- Imagine all possible text sequences as a tree.
- Each token you generate has a small probability ε of being wrong.
- The probability of generating n correct tokens in sequence is (1-ε)^n—an exponentially decaying function
- You can make ε smaller with more data and larger models, but you can’t fix the exponential decay
- World Model
- LLMs are “purely trained on text” with “no knowledge of the underlying reality.”
- When asked if a vector multiplied by a positive semi-definite matrix can rotate more than 90 degrees, a human visualizes the transformation.
- An LLM has no such mental model—it can only pattern-match against text it’s seen.
- System 1 Thinking
- Using Kahneman’s framework, current AI is stuck in System 1—reactive, immediate responses.
- There’s no System 2—no deliberation, no planning, no reasoning through multiple possibilities.
- LLMs can’t pause to think harder about difficult problems; they use the same computation for easy and hard questions alike.
JEPA: The Architecture
JEPA (Joint Embedding Predictive Architecture) represents a different way to think about machine intelligence.
Rather than generating outputs directly, JEPA:
- Encodes observations into abstract representations
- Predicts future representations in this latent space
- Uses energy functions instead of probabilities
- Employs regularization instead of contrastive learning
Given two compatible inputs (x and y), for example 2 consecutive video frames or two parts of an image:
- Encoder_x transforms x into a latent representation s_x
- Encoder_y transforms y into a latent representation s_y
- A Predictor uses s_x (and optionally a latent variable z) to predict ŝ_y
- An Energy function measures compatibility between s_y and ŝ_y

Let’s step back a minute to understand how this is different. Two fundamental ways a machine can learn to predict something:
- Prediction in Pixel Space (The Generative Approach)
- more intuitive but deeply flawed method
- Imagine you want to train an AI to understand videos.
- You show it the first 10 frames of a clip and ask it to generate the 11th frame, pixel by pixel.
- real world is not perfectly predictable
- generative model forced to predict the exact pixels of the next frame has two major problems:
- Blurry Predictions: Since it can’t know the single correct future, its best strategy is to predict the average of all possible futures
- Wasted Effort on Irrelevant Details: The model must predict every single detail, even the useless ones
- generative model forced to predict the exact pixels of the next frame has two major problems:
- Prediction in Representation Space (The JEPA Approach)
- approach argues that predicting every pixel is unnecessary and counterproductive
- goal is to predict an abstract summary or representation of the future
Encoder: This is a neural network that takes an input (like a video frame X) and “encodes” it into a compact, abstract representation (S_x). This representation is just a list of numbers (a vector) that captures the essential information while discarding the irrelevant noise.
Predictor: This module takes the representation of the current state (S_x) and predicts the representation of the future state (predicted S_y).
The Goal: The system is trained to make its predicted S_y as close as possible to the actual representation of the future frame (S_y), which is generated by passing the real future frame (Y) through the same encoder.
Presentation Notes by the Author