François Chollet: What LLMs Can't Do
Chollet built ARC-AGI to measure what LLMs systematically fail at. His 'On the Measure of Intelligence' paper is the sharpest critique of benchmark-based AI progress claims.
Who He Is
François Chollet created Keras and spent a decade at Google Brain. He is best known in the AI community for building the ARC-AGI benchmark — the most consequential challenge dataset for measuring genuine reasoning in AI systems — and for his paper 'On the Measure of Intelligence,' which remains the sharpest philosophical framework for thinking about what AI systems actually do.
Core Thesis
LLMs are extremely sophisticated interpolation engines. They are not — and cannot become — general intelligence through scaling alone. Measuring progress requires measuring generalisation to truly novel tasks.
Key Themes
- ARC-AGI: generalization as the test — tasks that require applying known primitives in new configurations, which current LLMs consistently fail without extensive test-time compute
- Intelligence ≠ skill — a model trained on a narrow distribution is not intelligent, it is skilled at that distribution
- The program synthesis view — human intelligence is more like program synthesis (composing rules) than pattern interpolation
- System 2 thinking in AI — deliberate, stepwise reasoning is different from fast-pattern retrieval, and current transformers do mostly the latter
- Benchmark gaming — when a benchmark becomes a training target it stops measuring what it was designed to measure
Essential Reading
| Resource | Format | Why It Matters |
|---|---|---|
| On the Measure of Intelligence (2019) | arXiv paper | The philosophical foundation: defines intelligence as skill-acquisition efficiency, not task performance. |
| ARC-AGI benchmark | GitHub / arcprize.org | The practical test of the thesis — 400 tasks humans solve easily, most AI systems still fail. |
| ARC Prize 2024 results | Blog post | Where the frontier stands: o3 with huge compute solved 88% — at $1000+/task. What that means. |
| The implausibility of intelligence explosion | Blog post | Chollet's case against recursive self-improvement narratives. |
| francois.chollet.work | Blog | Ongoing reflections on AI capabilities and limitations. |
What to Question
Chollet's critique is the most rigorous available — but ARC-AGI measures a specific type of generalisation (visual pattern induction). Many practical AI applications don't need human-level generalisation; they need reliable performance on a narrow distribution. His framework is essential for understanding limitations but can understate practical utility.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →