Structured Outputs: JSON Mode, Tool Calling, and Constrained Decoding
How to reliably get JSON, tables, and typed data from LLMs. JSON mode vs tool calling vs grammar-constrained decoding — what each guarantees and where each breaks.
Getting an LLM to return valid JSON sounds trivial until you've spent three hours debugging why your production pipeline intermittently returns a response that starts with 'Sure! Here's the JSON you asked for:' followed by a code block with a trailing comma. Structured outputs are a solved problem — but only if you pick the right tool.
This post maps the four main approaches: prompting, JSON mode, tool/function calling, and grammar-constrained decoding. Each gives different guarantees, and those differences matter at production scale.
Why LLMs Drift from Format Instructions
LLMs are trained to predict the most probable next token, not to follow schema rules. When you write 'respond with JSON', the model has seen billions of examples of humans responding to similar instructions with helpful prose explanations, code blocks, or hybrid formats. The instruction competes with those learned distributions.
- Models hallucinate extra prose before or after the JSON
- Models produce single-quoted strings instead of double-quoted
- Models include trailing commas (valid in JS, invalid in JSON)
- Long arrays get cut off mid-structure when near the context limit
- Optional fields get invented when not present in source data
- Nested objects deepen unpredictably when instructions are ambiguous
Failure rate on 'just add JSON instructions' prompting is typically 5–15% in production, depending on model and schema complexity. At 10,000 requests/day that's 500–1,500 parse errors. You need a structural guarantee, not a polite request.
Approach 1: Prompting (No Guarantees)
Prompting-only means asking the model to return JSON in the system or user prompt, with or without a schema. This requires no special API features but gives zero structural guarantees.
system = """You are a data extraction assistant. Always respond with valid JSON.
Schema: {"name": string, "age": integer, "email": string}
Never include any text before or after the JSON object."""
# Works most of the time. Fails 5-15% of the time in production.
# Failures: preamble text, invalid JSON, schema violations
Use prompting only for low-stakes, low-volume scenarios where a parse error is acceptable and retry is cheap.
Approach 2: JSON Mode
JSON mode (OpenAI, Anthropic) guarantees syntactically valid JSON output. The model is constrained at the decoding level to only produce tokens that keep the JSON syntax valid. It does NOT guarantee your schema — the model can return valid JSON that doesn't match your expected structure.
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Extract entity data as JSON with fields: name, age, email"},
{"role": "user", "content": user_input}
]
)
data = json.loads(response.choices[0].message.content)
# Guaranteed: valid JSON syntax
# NOT guaranteed: correct fields, correct types, no hallucinated fields
JSON mode prevents syntax errors but not schema errors. You still need Pydantic or Zod validation after parsing. The guarantee is: json.loads() will not throw. It says nothing about data shape.
Approach 3: Tool Calling / Function Calling
Tool calling (OpenAI function calling, Anthropic tool use) lets you define a JSON Schema for the expected output. The model is constrained to call the tool with arguments that match the schema. This gives you both syntax and schema guarantees for the fields you define.
tools = [{
"type": "function",
"function": {
"name": "extract_entity",
"description": "Extract entity information from text",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Full name"},
"age": {"type": "integer", "description": "Age in years"},
"email": {"type": "string", "format": "email"}
},
"required": ["name", "age", "email"],
"additionalProperties": false
}
}
}]
response = client.chat.completions.create(
model="gpt-4o",
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_entity"}},
messages=[{"role": "user", "content": user_input}]
)
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
# Guaranteed: valid JSON + matches schema types + required fields present
OpenAI's 'Structured Outputs' mode (August 2024) extended this further — when you pass strict=True, the model is constrained via a finite automaton to only produce tokens valid under your schema. This is the highest-reliability option in the OpenAI API.
Approach 4: Grammar-Constrained Decoding
For self-hosted models, libraries like Outlines, llama.cpp (GBNF grammars), and Guidance provide grammar-constrained decoding: the token sampling is mathematically constrained to only produce tokens that could lead to a valid output under your grammar or schema. Invalid token probabilities are zeroed out at each step.
import outlines
model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")
from pydantic import BaseModel
class Entity(BaseModel):
name: str
age: int
email: str
generator = outlines.generate.json(model, Entity)
# Every output is GUARANTEED to be a valid Entity instance
result = generator("Extract: John Smith, 34, john@example.com")
# result is already a validated Pydantic object, not a string
Outlines/GBNF decoding gives the strongest guarantees but adds ~10-20% latency overhead per token (the masking step). Worth it for complex nested schemas where validation failures are expensive.
Validation Patterns
Even with JSON mode or tool calling, always validate the parsed output:
from pydantic import BaseModel, ValidationError
import json
class Entity(BaseModel):
name: str
age: int
email: str
def extract_entity(text: str, max_retries: int = 3) -> Entity:
for attempt in range(max_retries):
response = call_llm_with_json_mode(text)
try:
data = json.loads(response)
return Entity(**data)
except (json.JSONDecodeError, ValidationError) as e:
if attempt == max_retries - 1:
raise
# On retry, add error context to the prompt
text = f"{text}
Previous attempt failed: {e}. Try again."
When Each Approach Wins
| Approach | Syntax guarantee | Schema guarantee | Cost | Best for |
|---|---|---|---|---|
| Prompting only | None | None | Lowest | Prototyping, very simple schemas |
| JSON mode | Yes | No | Low | Flexible schemas, catching syntax errors |
| Tool calling (strict) | Yes | Yes | Low | Production on OpenAI/Anthropic APIs |
| Grammar decoding (Outlines) | Yes | Yes (any grammar) | Medium (+latency) | Self-hosted, complex/recursive schemas |
Common Failure Modes
- Nested objects: the deeper the nesting, the more likely the model loses track of which bracket level it's at — especially without constrained decoding
- Optional fields: models often invent values for optional fields rather than omitting them, unless schema says explicitly 'omit if unknown'
- Long arrays: models truncate arrays mid-item when approaching the context window — validate array length and last element integrity
- Enum values: models sometimes produce values outside your enum, especially for string enums with many options — explicit enumeration in the schema helps
- Number precision: floats in JSON can drift (0.1 + 0.2 = 0.30000000000000004) — use string types for currency and validate with Decimal
Test in Playground →: Compare JSON mode vs tool calling vs prompting on the same extraction task. Watch the failure modes appear live.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →