AI Engineering

World Models: The Architecture Beyond Next-Token Prediction

Why predicting the next token isn't enough for autonomous intelligence. JEPA, World Action Models (WAMs), video-as-world-simulator, and what world models mean for agent planning and robotics.

The Problem With Next-Token Prediction

Every transformer language model is trained to predict the next token. This is a remarkably powerful objective — it forced models to learn grammar, facts, reasoning patterns, and world knowledge as intermediate steps. But Yann LeCun argues it has a fundamental ceiling: a next-token predictor has no internal model of how the world works. It predicts language about the world, not the world itself.

A world model is an internal representation of how the environment transitions from state to state. It lets an agent ask: 'If I take this action, what happens next?' — without executing the action.

What Is a World Model?

A world model is a predictive model of environment dynamics. Given the current state and an action, it predicts the next state. This is distinct from a language model in a key way: a language model predicts tokens, a world model predicts state transitions in some abstract space. The distinction matters for planning — an agent with a world model can run imagined rollouts to evaluate actions before taking them.

State representation: the model encodes observations (images, sensor readings, text) into a compact abstract state
Transition function: given current state + action, predict next state in the same abstract space
Reward / value model: estimate the utility of being in a state or taking an action
Policy: select actions based on the world model's predictions — planning by simulation

JEPA: Joint Embedding Predictive Architecture

LeCun's proposed architecture for world models. The key insight: don't predict in pixel space (too expensive, too noisy). Predict in representation space — the abstract embedding of what comes next, not the raw pixels.

# Conceptual JEPA forward pass
x_t = encoder(observation_t)    # encode current state
x_t1_pred = predictor(x_t, action)  # predict NEXT state embedding
x_t1_real = encoder(observation_t1) # encode actual next state
loss = distance(x_t1_pred, x_t1_real)  # match in embedding space
# Key: no pixel reconstruction — predict abstract representations

I-JEPA (image) and V-JEPA (video) are Meta's implementations. They learn rich representations of images and video dynamics without generative reconstruction, enabling efficient world modeling for visual domains.

World Action Models (WAMs)

WAMs (2026) extend JEPA to action-conditioned prediction: the model learns to predict how the world changes in response to specific actions. This is the key capability needed for robotic planning — given 'move arm 10cm left', predict the new visual state. WAMs unify the world model (what happens) with the action model (what I should do) into a single learned architecture.

2026 is widely called the breakthrough year for world models. WAMs are being applied to robotics, game AI, and physical simulation — anywhere an agent needs to plan without exhaustive real-world trial and error.

Video Generation as World Simulation

Sora (OpenAI), Veo (Google), and similar video generation models are being repurposed as world simulators. The insight: a model that generates physically plausible video of 'what happens when X' is implicitly encoding world dynamics. Researchers are using these as 'video world models' for training robotic policies — the robot learns from simulated video without physical trials.

Genie (Google DeepMind, 2024): learned a world model from internet video, enabling interactive 2D environments from a single image
Sora as simulator: OpenAI describes Sora as a 'simulator of the world' — generating consistent physics in extended video
Practical limit: video world models encode appearance dynamics well but struggle with precise physical constraints (object permanence, exact forces)

Why This Matters for Agents

Current LLM-based agents plan by reasoning in language ('I should do X, then Y'). This works for tasks where the state space is language. For physical tasks — robotics, computer use, real-world navigation — language-space planning is brittle. World models enable agents to mentally simulate the outcome of actions before executing them, dramatically improving planning in non-language domains.

What's Not Solved

World models remain largely unproven at the scale where transformers dominate. JEPA-based models are compelling but haven't matched transformer performance on standard language and reasoning benchmarks. The gap between 'learns video dynamics' and 'useful for planning in complex environments' is still large. World models are the most intellectually compelling frontier in AI — and the most uncertain.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →