AI Engineering 8 min read

Structured Outputs from LLMs: JSON Mode, Function Calling, and Tool Use

How to make LLMs output reliable, parseable data. JSON mode, OpenAI function calling, Pydantic validation, and when structured outputs break.

Getting an LLM to return valid JSON consistently is one of the most common production challenges. The model knows the format — it just doesn't always follow it. Structured outputs are the set of techniques that make reliable machine-readable output possible.

Why free-form text fails in production

An LLM generating natural language text might return: "Sure! Here is the JSON you requested: {\"name\": \"Alice\"}" — with preamble text that breaks JSON.parse(). Or it might use single quotes instead of double quotes. Or omit required fields. Or invent fields that don't exist in your schema. Any of these crashes your pipeline.

At 10K requests/day, even a 0.5% malformed output rate means 50 crashes per day. Free-form text output is not acceptable for any production pipeline that needs to parse the response.

Approach 1: JSON mode

Most major APIs offer a JSON mode that constrains the model to only emit valid JSON. The model still generates the structure, but the output is guaranteed to parse.

response = client.chat.completions.create(
  model="gpt-4o",
  response_format={"type": "json_object"},
  messages=[
    {"role": "system", "content": "Return a JSON object with keys: name, age, role"},
    {"role": "user", "content": "Extract info: Alice is a 30-year-old engineer"}
  ]
)
# Guaranteed to be valid JSON — but structure is up to the model

Always describe the expected schema in your system prompt when using JSON mode. The mode guarantees valid JSON but not the right keys. Specify every field name and type explicitly.

Approach 2: Structured outputs with schema

OpenAI's structured outputs feature (and Anthropic's tool use) allow you to pass a JSON Schema that the model's output will conform to exactly. This is stronger than JSON mode — not only is the output valid JSON, it matches your schema including required fields and types.

from pydantic import BaseModel
from openai import OpenAI

class PersonExtraction(BaseModel):
    name: str
    age: int
    role: str
    confidence: float  # 0-1

client = OpenAI()
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Alice is a 30-year-old senior engineer"}
    ],
    response_format=PersonExtraction,
)
person = response.choices[0].message.parsed
print(person.name, person.age)  # type-safe, no KeyError possible

Approach 3: Function calling / tool use

Function calling was the original structured output mechanism. You define a function signature with a JSON Schema, and the model decides when to call it and with what arguments. The arguments are guaranteed to match your schema.

import anthropic

client = anthropic.Anthropic()

tools = [{
  "name": "extract_person",
  "description": "Extract person information from text",
  "input_schema": {
    "type": "object",
    "properties": {
      "name":       {"type": "string"},
      "age":        {"type": "integer"},
      "role":       {"type": "string"},
      "seniority":  {"type": "string", "enum": ["junior", "mid", "senior", "staff"]}
    },
    "required": ["name", "age", "role"]
  }
}]

response = client.messages.create(
  model="claude-opus-4-6",
  max_tokens=1024,
  tools=tools,
  messages=[{"role": "user", "content": "Alice is a 30yr senior engineer"}]
)
tool_input = response.content[0].input  # Validated against schema

Approach 4: Constrained decoding

Libraries like Outlines and Guidance constrain the model's token sampling at decode time to only emit tokens that are valid according to a grammar or regex. This is the strongest guarantee — it's physically impossible for the model to emit invalid output.

import outlines
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")
generator = outlines.generate.json(model, Person)
result = generator("Extract: Alice is 30 years old")
# result is guaranteed to be a valid Person instance

Constrained decoding is the gold standard for open-source/self-hosted models. For API models, structured outputs with schema validation is equivalent in practice — the provider enforces the schema server-side.

Approach 5: Retry with validation

For legacy setups or models without native structured output support, the fallback is: parse the output, catch validation errors, and retry with the error message included in the next prompt. This is brittle but workable at low volume.

import json, re
from jsonschema import validate, ValidationError

def get_structured(prompt, schema, max_retries=3):
    messages = [{"role": "user", "content": prompt}]
    for attempt in range(max_retries):
        response = llm(messages)
        try:
            # Strip markdown code fences if present
            text = re.sub(r'```json?
?|
?```', '', response).strip()
            data = json.loads(text)
            validate(data, schema)
            return data
        except (json.JSONDecodeError, ValidationError) as e:
            messages.append({"role": "assistant", "content": response})
            messages.append({"role": "user", "content": f"Invalid output: {e}. Please fix and return only valid JSON."})
    raise ValueError(f"Failed after {max_retries} retries")

Which approach to use

Approach	Guarantee	Latency hit	Best for
Free-form + regex	None	None	Prototypes only
JSON mode	Valid JSON	~0%	Simple extraction, any structure
Structured outputs	Schema match	~0%	Production API pipelines
Function calling	Schema match	~0%	Agentic workflows, tool invocation
Constrained decoding	Grammar exact	5–15%	Self-hosted, high-stakes outputs
Retry with validation	Eventually	High	Legacy fallback only

Production failure modes — and how to catch them

Even with structured outputs enabled, production pipelines break. These are the failure modes that get past your JSON parser and into your application logic:

Failure mode	Example	Catch with
Null fields	Model omits an optional field — downstream code throws KeyError	Default values in Pydantic model or explicit None checks
Out-of-range values	confidence: 1.7 (should be 0–1)	Pydantic validator: @field_validator
Enum violation	role: 'contractor' when schema only allows employee\|manager	Literal type in Pydantic or enum field
Hallucinated keys	Model adds extra fields not in schema	Pydantic's model_config = {'extra': 'forbid'}
Empty arrays	items: [] when at least 1 was required	min_length constraint on list field
Semantic invalidity	start_date > end_date (structurally valid, logically wrong)	Cross-field validator in Pydantic

from pydantic import BaseModel, field_validator, model_validator
from typing import Literal
from datetime import date

class ProjectExtraction(BaseModel):
    model_config = {"extra": "forbid"}   # reject unknown fields

    name: str
    status: Literal["active", "archived", "draft"]
    priority: int                         # 1-5
    start_date: date
    end_date: date | None = None
    tags: list[str] = []                  # optional, defaults to empty

    @field_validator("priority")
    @classmethod
    def priority_range(cls, v):
        if not 1 <= v <= 5:
            raise ValueError(f"Priority must be 1-5, got {v}")
        return v

    @model_validator(mode="after")
    def end_after_start(self):
        if self.end_date and self.end_date < self.start_date:
            raise ValueError("end_date must be >= start_date")
        return self

Streaming structured outputs

For long structured outputs, waiting for the full JSON object before showing anything creates a poor UX. Both OpenAI and Anthropic support streaming — but streaming and structured outputs don't mix out of the box. The pattern: stream raw tokens, accumulate, parse the complete object at the end. For partial display, use a streaming JSON parser (ijson in Python) to extract completed fields as they arrive.

For very long structured outputs (e.g., a detailed report with 20 fields), consider breaking them into multiple API calls with smaller schemas. One call per section, one schema per call. This avoids token budget issues and makes partial streaming much simpler.

Testing structured output schemas

Your schema is code. It should have tests. At minimum: test that valid outputs parse correctly, test that common malformed outputs trigger the right validation errors, and test that edge cases (empty strings, None values, extreme numbers) are handled according to your schema's intent. Use Pydantic's model_validate() with pytest parametrize for clean coverage.

Try structured outputs in the Explore module →: Test JSON mode, tool use, and schema validation live.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →