Structured Outputs from LLMs: JSON Mode, Function Calling, and Tool Use
How to make LLMs output reliable, parseable data. JSON mode, OpenAI function calling, Pydantic validation, and when structured outputs break.
Getting an LLM to return valid JSON consistently is one of the most common production challenges. The model knows the format — it just doesn't always follow it. Structured outputs are the set of techniques that make reliable machine-readable output possible.
Why free-form text fails in production
An LLM generating natural language text might return: "Sure! Here is the JSON you requested: {\"name\": \"Alice\"}" — with preamble text that breaks JSON.parse(). Or it might use single quotes instead of double quotes. Or omit required fields. Or invent fields that don't exist in your schema. Any of these crashes your pipeline.
At 10K requests/day, even a 0.5% malformed output rate means 50 crashes per day. Free-form text output is not acceptable for any production pipeline that needs to parse the response.
Approach 1: JSON mode
Most major APIs offer a JSON mode that constrains the model to only emit valid JSON. The model still generates the structure, but the output is guaranteed to parse.
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Return a JSON object with keys: name, age, role"},
{"role": "user", "content": "Extract info: Alice is a 30-year-old engineer"}
]
)
# Guaranteed to be valid JSON — but structure is up to the model
Always describe the expected schema in your system prompt when using JSON mode. The mode guarantees valid JSON but not the right keys. Specify every field name and type explicitly.
Approach 2: Structured outputs with schema
OpenAI's structured outputs feature (and Anthropic's tool use) allow you to pass a JSON Schema that the model's output will conform to exactly. This is stronger than JSON mode — not only is the output valid JSON, it matches your schema including required fields and types.
from pydantic import BaseModel
from openai import OpenAI
class PersonExtraction(BaseModel):
name: str
age: int
role: str
confidence: float # 0-1
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "user", "content": "Alice is a 30-year-old senior engineer"}
],
response_format=PersonExtraction,
)
person = response.choices[0].message.parsed
print(person.name, person.age) # type-safe, no KeyError possible
Approach 3: Function calling / tool use
Function calling was the original structured output mechanism. You define a function signature with a JSON Schema, and the model decides when to call it and with what arguments. The arguments are guaranteed to match your schema.
import anthropic
client = anthropic.Anthropic()
tools = [{
"name": "extract_person",
"description": "Extract person information from text",
"input_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"role": {"type": "string"},
"seniority": {"type": "string", "enum": ["junior", "mid", "senior", "staff"]}
},
"required": ["name", "age", "role"]
}
}]
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "Alice is a 30yr senior engineer"}]
)
tool_input = response.content[0].input # Validated against schema
Approach 4: Constrained decoding
Libraries like Outlines and Guidance constrain the model's token sampling at decode time to only emit tokens that are valid according to a grammar or regex. This is the strongest guarantee — it's physically impossible for the model to emit invalid output.
import outlines
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")
generator = outlines.generate.json(model, Person)
result = generator("Extract: Alice is 30 years old")
# result is guaranteed to be a valid Person instance
Constrained decoding is the gold standard for open-source/self-hosted models. For API models, structured outputs with schema validation is equivalent in practice — the provider enforces the schema server-side.
Approach 5: Retry with validation
For legacy setups or models without native structured output support, the fallback is: parse the output, catch validation errors, and retry with the error message included in the next prompt. This is brittle but workable at low volume.
import json, re
from jsonschema import validate, ValidationError
def get_structured(prompt, schema, max_retries=3):
messages = [{"role": "user", "content": prompt}]
for attempt in range(max_retries):
response = llm(messages)
try:
# Strip markdown code fences if present
text = re.sub(r'```json?
?|
?```', '', response).strip()
data = json.loads(text)
validate(data, schema)
return data
except (json.JSONDecodeError, ValidationError) as e:
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": f"Invalid output: {e}. Please fix and return only valid JSON."})
raise ValueError(f"Failed after {max_retries} retries")
Which approach to use
| Approach | Guarantee | Latency hit | Best for |
|---|---|---|---|
| Free-form + regex | None | None | Prototypes only |
| JSON mode | Valid JSON | ~0% | Simple extraction, any structure |
| Structured outputs | Schema match | ~0% | Production API pipelines |
| Function calling | Schema match | ~0% | Agentic workflows, tool invocation |
| Constrained decoding | Grammar exact | 5–15% | Self-hosted, high-stakes outputs |
| Retry with validation | Eventually | High | Legacy fallback only |
Production failure modes — and how to catch them
Even with structured outputs enabled, production pipelines break. These are the failure modes that get past your JSON parser and into your application logic:
| Failure mode | Example | Catch with |
|---|---|---|
| Null fields | Model omits an optional field — downstream code throws KeyError | Default values in Pydantic model or explicit None checks |
| Out-of-range values | confidence: 1.7 (should be 0–1) | Pydantic validator: @field_validator |
| Enum violation | role: 'contractor' when schema only allows employee|manager | Literal type in Pydantic or enum field |
| Hallucinated keys | Model adds extra fields not in schema | Pydantic's model_config = {'extra': 'forbid'} |
| Empty arrays | items: [] when at least 1 was required | min_length constraint on list field |
| Semantic invalidity | start_date > end_date (structurally valid, logically wrong) | Cross-field validator in Pydantic |
from pydantic import BaseModel, field_validator, model_validator
from typing import Literal
from datetime import date
class ProjectExtraction(BaseModel):
model_config = {"extra": "forbid"} # reject unknown fields
name: str
status: Literal["active", "archived", "draft"]
priority: int # 1-5
start_date: date
end_date: date | None = None
tags: list[str] = [] # optional, defaults to empty
@field_validator("priority")
@classmethod
def priority_range(cls, v):
if not 1 <= v <= 5:
raise ValueError(f"Priority must be 1-5, got {v}")
return v
@model_validator(mode="after")
def end_after_start(self):
if self.end_date and self.end_date < self.start_date:
raise ValueError("end_date must be >= start_date")
return self
Streaming structured outputs
For long structured outputs, waiting for the full JSON object before showing anything creates a poor UX. Both OpenAI and Anthropic support streaming — but streaming and structured outputs don't mix out of the box. The pattern: stream raw tokens, accumulate, parse the complete object at the end. For partial display, use a streaming JSON parser (ijson in Python) to extract completed fields as they arrive.
For very long structured outputs (e.g., a detailed report with 20 fields), consider breaking them into multiple API calls with smaller schemas. One call per section, one schema per call. This avoids token budget issues and makes partial streaming much simpler.
Testing structured output schemas
Your schema is code. It should have tests. At minimum: test that valid outputs parse correctly, test that common malformed outputs trigger the right validation errors, and test that edge cases (empty strings, None values, extreme numbers) are handled according to your schema's intent. Use Pydantic's model_validate() with pytest parametrize for clean coverage.
Try structured outputs in the Explore module →: Test JSON mode, tool use, and schema validation live.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →