AI Engineering 8 min read

Structured Outputs: JSON Mode, Tool Calling, and Constrained Decoding

How to reliably get JSON, tables, and typed data from LLMs. JSON mode vs tool calling vs grammar-constrained decoding — what each guarantees and where each breaks.

Getting an LLM to return valid JSON sounds trivial until you've spent three hours debugging why your production pipeline intermittently returns a response that starts with 'Sure! Here's the JSON you asked for:' followed by a code block with a trailing comma. Structured outputs are a solved problem — but only if you pick the right tool.

This post maps the four main approaches: prompting, JSON mode, tool/function calling, and grammar-constrained decoding. Each gives different guarantees, and those differences matter at production scale.

Why LLMs Drift from Format Instructions

LLMs are trained to predict the most probable next token, not to follow schema rules. When you write 'respond with JSON', the model has seen billions of examples of humans responding to similar instructions with helpful prose explanations, code blocks, or hybrid formats. The instruction competes with those learned distributions.

Models hallucinate extra prose before or after the JSON
Models produce single-quoted strings instead of double-quoted
Models include trailing commas (valid in JS, invalid in JSON)
Long arrays get cut off mid-structure when near the context limit
Optional fields get invented when not present in source data
Nested objects deepen unpredictably when instructions are ambiguous

Failure rate on 'just add JSON instructions' prompting is typically 5–15% in production, depending on model and schema complexity. At 10,000 requests/day that's 500–1,500 parse errors. You need a structural guarantee, not a polite request.

Approach 1: Prompting (No Guarantees)

Prompting-only means asking the model to return JSON in the system or user prompt, with or without a schema. This requires no special API features but gives zero structural guarantees.

system = """You are a data extraction assistant. Always respond with valid JSON.
Schema: {"name": string, "age": integer, "email": string}
Never include any text before or after the JSON object."""

# Works most of the time. Fails 5-15% of the time in production.
# Failures: preamble text, invalid JSON, schema violations

Use prompting only for low-stakes, low-volume scenarios where a parse error is acceptable and retry is cheap.

Approach 2: JSON Mode

JSON mode (OpenAI, Anthropic) guarantees syntactically valid JSON output. The model is constrained at the decoding level to only produce tokens that keep the JSON syntax valid. It does NOT guarantee your schema — the model can return valid JSON that doesn't match your expected structure.

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract entity data as JSON with fields: name, age, email"},
        {"role": "user", "content": user_input}
    ]
)

data = json.loads(response.choices[0].message.content)
# Guaranteed: valid JSON syntax
# NOT guaranteed: correct fields, correct types, no hallucinated fields

JSON mode prevents syntax errors but not schema errors. You still need Pydantic or Zod validation after parsing. The guarantee is: json.loads() will not throw. It says nothing about data shape.

Approach 3: Tool Calling / Function Calling

Tool calling (OpenAI function calling, Anthropic tool use) lets you define a JSON Schema for the expected output. The model is constrained to call the tool with arguments that match the schema. This gives you both syntax and schema guarantees for the fields you define.

tools = [{
    "type": "function",
    "function": {
        "name": "extract_entity",
        "description": "Extract entity information from text",
        "parameters": {
            "type": "object",
            "properties": {
                "name":  {"type": "string", "description": "Full name"},
                "age":   {"type": "integer", "description": "Age in years"},
                "email": {"type": "string", "format": "email"}
            },
            "required": ["name", "age", "email"],
            "additionalProperties": false
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "extract_entity"}},
    messages=[{"role": "user", "content": user_input}]
)

args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
# Guaranteed: valid JSON + matches schema types + required fields present

OpenAI's 'Structured Outputs' mode (August 2024) extended this further — when you pass strict=True, the model is constrained via a finite automaton to only produce tokens valid under your schema. This is the highest-reliability option in the OpenAI API.

Approach 4: Grammar-Constrained Decoding

For self-hosted models, libraries like Outlines, llama.cpp (GBNF grammars), and Guidance provide grammar-constrained decoding: the token sampling is mathematically constrained to only produce tokens that could lead to a valid output under your grammar or schema. Invalid token probabilities are zeroed out at each step.

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")

from pydantic import BaseModel

class Entity(BaseModel):
    name: str
    age: int
    email: str

generator = outlines.generate.json(model, Entity)

# Every output is GUARANTEED to be a valid Entity instance
result = generator("Extract: John Smith, 34, john@example.com")
# result is already a validated Pydantic object, not a string

Outlines/GBNF decoding gives the strongest guarantees but adds ~10-20% latency overhead per token (the masking step). Worth it for complex nested schemas where validation failures are expensive.

Validation Patterns

Even with JSON mode or tool calling, always validate the parsed output:

from pydantic import BaseModel, ValidationError
import json

class Entity(BaseModel):
    name: str
    age: int
    email: str

def extract_entity(text: str, max_retries: int = 3) -> Entity:
    for attempt in range(max_retries):
        response = call_llm_with_json_mode(text)
        try:
            data = json.loads(response)
            return Entity(**data)
        except (json.JSONDecodeError, ValidationError) as e:
            if attempt == max_retries - 1:
                raise
            # On retry, add error context to the prompt
            text = f"{text}

Previous attempt failed: {e}. Try again."

When Each Approach Wins

Approach	Syntax guarantee	Schema guarantee	Cost	Best for
Prompting only	None	None	Lowest	Prototyping, very simple schemas
JSON mode	Yes	No	Low	Flexible schemas, catching syntax errors
Tool calling (strict)	Yes	Yes	Low	Production on OpenAI/Anthropic APIs
Grammar decoding (Outlines)	Yes	Yes (any grammar)	Medium (+latency)	Self-hosted, complex/recursive schemas

Common Failure Modes

Nested objects: the deeper the nesting, the more likely the model loses track of which bracket level it's at — especially without constrained decoding
Optional fields: models often invent values for optional fields rather than omitting them, unless schema says explicitly 'omit if unknown'
Long arrays: models truncate arrays mid-item when approaching the context window — validate array length and last element integrity
Enum values: models sometimes produce values outside your enum, especially for string enums with many options — explicit enumeration in the schema helps
Number precision: floats in JSON can drift (0.1 + 0.2 = 0.30000000000000004) — use string types for currency and validate with Decimal

Test in Playground →: Compare JSON mode vs tool calling vs prompting on the same extraction task. Watch the failure modes appear live.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →