AI Engineering 8 min read

How ChatGPT Works: GPT-4o, RLHF, and the o1 Reasoning Models

From the base GPT model to RLHF fine-tuning to GPT-4o's native multimodality. What OpenAI's model family does and how o1/o3 reasoning models think differently.

ChatGPT didn't just launch a product. It launched a category. When it hit 100 million users in two months — the fastest product in history to that point — it wasn't because of the underlying technology alone. It was because OpenAI made the technology feel accessible. Understanding how that was achieved, technically, is one of the most instructive case studies in applied AI.

[Video: Andrej Karpathy — Intro to Large Language Models (the canonical 1-hour explanation of pretraining, RLHF, and how ChatGPT-style systems are built)]

The model family (as of 2025)

Model	Context	Multimodal	Best for
GPT-4o	128K	Text + image + audio	Most production tasks — fast and capable
GPT-4o-mini	128K	Text + image	High-volume, cost-sensitive applications
o1 / o3	128K	Text	Hard reasoning: math, science, complex coding
o4-mini	128K	Text + image	Efficient reasoning — o3-class quality at lower cost

From text predictor to assistant: RLHF

The base GPT model, pre-trained on internet text, is a next-token predictor. It's very capable, but not an assistant — it completes patterns, not requests. ChatGPT's conversational, helpful behaviour comes from a fine-tuning process called Reinforcement Learning from Human Feedback (RLHF).

Step 1 — Supervised Fine-Tuning (SFT): OpenAI's labellers write high-quality responses to curated prompts. The model is fine-tuned on these examples.
Step 2 — Reward model training: labellers rank multiple model responses for the same prompt. A separate reward model is trained to predict human preference scores.
Step 3 — PPO RL: the fine-tuned model is further trained to maximise the reward model's score via Proximal Policy Optimisation. This pushes it toward responses humans prefer.

RLHF is what makes a model feel like an assistant rather than a text predictor. The base model knows language — RLHF teaches it to be helpful, follow instructions, apologise when uncertain, and avoid harmful outputs. This is the same core technique behind Claude, Gemini, and most frontier assistants.

GPT-4o: natively multimodal architecture

GPT-4o ("omni") was the first GPT model with native multimodality — text, image, and audio processed within a single unified model rather than separate specialist models stitched together. Earlier GPT-4 vision was a text model with a separately-bolted vision encoder. GPT-4o unified them, which improves performance on tasks that require reasoning across modalities simultaneously.

Practically: GPT-4o can read a chart, understand a diagram, or interpret code in a screenshot and reason about it in the same forward pass as the text in your query. It's also significantly faster and cheaper than GPT-4 Turbo — making it the default for most API production deployments.

The o1/o3/o4 reasoning models — a different paradigm

OpenAI's reasoning model series (o1, o3, o4-mini) takes a fundamentally different approach to hard problems. Rather than generating the answer directly, they produce an extended internal chain of reasoning — a 'thinking' scratchpad — before emitting a final response.

This inference-time compute scaling lets reasoning models outperform GPT-4o on tasks that benefit from deliberate step-by-step reasoning: competitive math, complex debugging, science problems, multi-step logic. The tradeoff: significantly higher latency (10–60 seconds for hard problems) and higher cost.

Task type	GPT-4o	o1/o3/o4
General Q&A and writing	Better — fast, fluent	Overkill — slow and expensive
Complex multi-step math	OK	Significantly better
Hard competitive coding	Good	Best available (o3 tops IOI)
Science reasoning (GPQA)	~50%	~80%+ (o3)
Instruction following	Excellent	Good but less nuanced

The OpenAI API — key patterns for builders

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()  # uses OPENAI_API_KEY

# Basic generation
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an expert in AI systems."},
        {"role": "user", "content": "Explain RLHF in 3 sentences."}
    ]
)
print(response.choices[0].message.content)

# Structured output with Pydantic
class SentimentResult(BaseModel):
    sentiment: str  # positive | negative | neutral
    score: float    # 0-1
    reason: str

result = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Analyse: 'This RAG system is great but slow'"}],
    response_format=SentimentResult,
)
print(result.choices[0].message.parsed)  # type-safe, schema-validated

# Streaming
with client.chat.completions.stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about LLMs"}]
) as stream:
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

OpenAI vs Claude — when to choose which

Factor	GPT-4o / OpenAI	Claude / Anthropic
Multimodal (vision + audio)	Native — best-in-class audio	Vision strong, no native audio
Context window	128K	200K
Reasoning models	o1/o3/o4 — highly capable	Extended thinking — competitive
Tool use / function calling	Mature ecosystem, well-documented	Strong, native tool use
Safety / refusals	RLHF-based — binary refusals common	CAI-based — contextual, explains reasoning
Prompt caching	Available	Available — strong for long contexts
Community / ecosystem	Largest — most SDKs, integrations	Growing fast

Compare models in Playground →: Test GPT-4o alongside Claude and Gemini on the same prompts — see where each model shines and where it struggles.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →