Toolformer: How Language Models Learned to Use APIs on Their Own
Meta AI's 2023 paper where a model bootstraps its own tool-use training from unlabeled text. The precursor to function calling — and what it reveals about how tool use actually works.
Teaching a model to use tools traditionally required labelled data: here's a question, here's the API call that answers it, here's the result. Collecting this at scale was expensive — and the model only knew the tools it had seen in training.
In February 2023, Timo Schick and colleagues at Meta AI published 'Toolformer: Language Models Can Teach Themselves to Use Tools'. The proposal: let the model annotate its own training corpus with API calls at positions where they'd be useful — then fine-tune on the successful ones. Self-supervised tool-use training, no human labelling required.
How Toolformer bootstraps tool annotations
- Start with an unlabelled text corpus and a set of APIs (calculator, Wikipedia search, calendar, translation)
- For each sentence, sample positions where an API call might be useful
- Generate candidate API calls using few-shot prompting
- Execute the API calls and insert results into the text
- Keep only calls where including the result reduces perplexity on the rest of the sentence — the tool genuinely helped
- Fine-tune the model on this self-annotated dataset
The filtering step is the key innovation: keep API calls where the result reduces perplexity. This self-supervised signal selects only tool calls that genuinely improve the model's predictions — no human labelling required.
What Toolformer learned
The model learned when tools are more reliable than its own knowledge — using the calculator for arithmetic, Wikipedia for factual retrieval, the calendar for date calculations. Tool use was selective: not every mention of a topic triggered a call, only when the model's own generation would likely be wrong.
Toolformer vs. modern function calling
| Aspect | Toolformer | GPT-4 Function Calling |
|---|---|---|
| Tool specification | Hard-coded in training | JSON schema at inference time |
| New tools | Requires retraining | Add to system prompt — no retraining |
| Flexibility | Fixed tool set | Arbitrary tools, dynamic schemas |
| Reliability for known tools | Trained — highly reliable | Depends on schema quality and model reasoning |
The principle that matters for builders
Only route to a tool when the model's own generation is likely unreliable for that task type. A model that calls a tool every time a topic is mentioned adds latency and cost without improving quality. Tool use should be selective and purposeful — Toolformer's perplexity filter is the right mental model.
Explore tool use patterns in the Agents Lab →: See how different tool-calling strategies affect agent reliability and task completion.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →