AI Engineering 8 min read

Toolformer: How Language Models Learned to Use APIs on Their Own

Meta AI's 2023 paper where a model bootstraps its own tool-use training from unlabeled text. The precursor to function calling — and what it reveals about how tool use actually works.

Teaching a model to use tools traditionally required labelled data: here's a question, here's the API call that answers it, here's the result. Collecting this at scale was expensive — and the model only knew the tools it had seen in training.

In February 2023, Timo Schick and colleagues at Meta AI published 'Toolformer: Language Models Can Teach Themselves to Use Tools'. The proposal: let the model annotate its own training corpus with API calls at positions where they'd be useful — then fine-tune on the successful ones. Self-supervised tool-use training, no human labelling required.

How Toolformer bootstraps tool annotations

Start with an unlabelled text corpus and a set of APIs (calculator, Wikipedia search, calendar, translation)
For each sentence, sample positions where an API call might be useful
Generate candidate API calls using few-shot prompting
Execute the API calls and insert results into the text
Keep only calls where including the result reduces perplexity on the rest of the sentence — the tool genuinely helped
Fine-tune the model on this self-annotated dataset

The filtering step is the key innovation: keep API calls where the result reduces perplexity. This self-supervised signal selects only tool calls that genuinely improve the model's predictions — no human labelling required.

What Toolformer learned

The model learned when tools are more reliable than its own knowledge — using the calculator for arithmetic, Wikipedia for factual retrieval, the calendar for date calculations. Tool use was selective: not every mention of a topic triggered a call, only when the model's own generation would likely be wrong.

Toolformer vs. modern function calling

Aspect	Toolformer	GPT-4 Function Calling
Tool specification	Hard-coded in training	JSON schema at inference time
New tools	Requires retraining	Add to system prompt — no retraining
Flexibility	Fixed tool set	Arbitrary tools, dynamic schemas
Reliability for known tools	Trained — highly reliable	Depends on schema quality and model reasoning

The principle that matters for builders

Only route to a tool when the model's own generation is likely unreliable for that task type. A model that calls a tool every time a topic is mentioned adds latency and cost without improving quality. Tool use should be selective and purposeful — Toolformer's perplexity filter is the right mental model.

Explore tool use patterns in the Agents Lab →: See how different tool-calling strategies affect agent reliability and task completion.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →