AI Engineering

François Chollet: What LLMs Can't Do

Chollet built ARC-AGI to measure what LLMs systematically fail at. His 'On the Measure of Intelligence' paper is the sharpest critique of benchmark-based AI progress claims.

Who He Is

François Chollet created Keras and spent a decade at Google Brain. He is best known in the AI community for building the ARC-AGI benchmark — the most consequential challenge dataset for measuring genuine reasoning in AI systems — and for his paper 'On the Measure of Intelligence,' which remains the sharpest philosophical framework for thinking about what AI systems actually do.

Core Thesis

LLMs are extremely sophisticated interpolation engines. They are not — and cannot become — general intelligence through scaling alone. Measuring progress requires measuring generalisation to truly novel tasks.

Key Themes

ARC-AGI: generalization as the test — tasks that require applying known primitives in new configurations, which current LLMs consistently fail without extensive test-time compute
Intelligence ≠ skill — a model trained on a narrow distribution is not intelligent, it is skilled at that distribution
The program synthesis view — human intelligence is more like program synthesis (composing rules) than pattern interpolation
System 2 thinking in AI — deliberate, stepwise reasoning is different from fast-pattern retrieval, and current transformers do mostly the latter
Benchmark gaming — when a benchmark becomes a training target it stops measuring what it was designed to measure

Essential Reading

Resource	Format	Why It Matters
On the Measure of Intelligence (2019)	arXiv paper	The philosophical foundation: defines intelligence as skill-acquisition efficiency, not task performance.
ARC-AGI benchmark	GitHub / arcprize.org	The practical test of the thesis — 400 tasks humans solve easily, most AI systems still fail.
ARC Prize 2024 results	Blog post	Where the frontier stands: o3 with huge compute solved 88% — at $1000+/task. What that means.
The implausibility of intelligence explosion	Blog post	Chollet's case against recursive self-improvement narratives.
francois.chollet.work	Blog	Ongoing reflections on AI capabilities and limitations.

What to Question

Chollet's critique is the most rigorous available — but ARC-AGI measures a specific type of generalisation (visual pattern induction). Many practical AI applications don't need human-level generalisation; they need reliable performance on a narrow distribution. His framework is essential for understanding limitations but can understate practical utility.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →