GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

Toolformer: How Language Models Learned to Use APIs on Their Own

Meta AI's 2023 paper where a model bootstraps its own tool-use training from unlabeled text. The precursor to function calling — and what it reveals about how tool use actually works.

Teaching a model to use tools traditionally required labelled data: here's a question, here's the API call that answers it, here's the result. Collecting this at scale was expensive — and the model only knew the tools it had seen in training.

In February 2023, Timo Schick and colleagues at Meta AI published 'Toolformer: Language Models Can Teach Themselves to Use Tools'. The proposal: let the model annotate its own training corpus with API calls at positions where they'd be useful — then fine-tune on the successful ones. Self-supervised tool-use training, no human labelling required.

How Toolformer bootstraps tool annotations

The filtering step is the key innovation: keep API calls where the result reduces perplexity. This self-supervised signal selects only tool calls that genuinely improve the model's predictions — no human labelling required.

What Toolformer learned

The model learned when tools are more reliable than its own knowledge — using the calculator for arithmetic, Wikipedia for factual retrieval, the calendar for date calculations. Tool use was selective: not every mention of a topic triggered a call, only when the model's own generation would likely be wrong.

Toolformer vs. modern function calling

AspectToolformerGPT-4 Function Calling
Tool specificationHard-coded in trainingJSON schema at inference time
New toolsRequires retrainingAdd to system prompt — no retraining
FlexibilityFixed tool setArbitrary tools, dynamic schemas
Reliability for known toolsTrained — highly reliableDepends on schema quality and model reasoning

The principle that matters for builders

Only route to a tool when the model's own generation is likely unreliable for that task type. A model that calls a tool every time a topic is mentioned adds latency and cost without improving quality. Tool use should be selective and purposeful — Toolformer's perplexity filter is the right mental model.

Explore tool use patterns in the Agents Lab →: See how different tool-calling strategies affect agent reliability and task completion.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →