GenAI Systems Lab Open interactive version →
AI Engineering 8 min read

How ChatGPT Works: GPT-4o, RLHF, and the o1 Reasoning Models

From the base GPT model to RLHF fine-tuning to GPT-4o's native multimodality. What OpenAI's model family does and how o1/o3 reasoning models think differently.

ChatGPT didn't just launch a product. It launched a category. When it hit 100 million users in two months — the fastest product in history to that point — it wasn't because of the underlying technology alone. It was because OpenAI made the technology feel accessible. Understanding how that was achieved, technically, is one of the most instructive case studies in applied AI.

[Video: Andrej Karpathy — Intro to Large Language Models (the canonical 1-hour explanation of pretraining, RLHF, and how ChatGPT-style systems are built)]

The model family (as of 2025)

ModelContextMultimodalBest for
GPT-4o128KText + image + audioMost production tasks — fast and capable
GPT-4o-mini128KText + imageHigh-volume, cost-sensitive applications
o1 / o3128KTextHard reasoning: math, science, complex coding
o4-mini128KText + imageEfficient reasoning — o3-class quality at lower cost

From text predictor to assistant: RLHF

The base GPT model, pre-trained on internet text, is a next-token predictor. It's very capable, but not an assistant — it completes patterns, not requests. ChatGPT's conversational, helpful behaviour comes from a fine-tuning process called Reinforcement Learning from Human Feedback (RLHF).

RLHF is what makes a model feel like an assistant rather than a text predictor. The base model knows language — RLHF teaches it to be helpful, follow instructions, apologise when uncertain, and avoid harmful outputs. This is the same core technique behind Claude, Gemini, and most frontier assistants.

GPT-4o: natively multimodal architecture

GPT-4o ("omni") was the first GPT model with native multimodality — text, image, and audio processed within a single unified model rather than separate specialist models stitched together. Earlier GPT-4 vision was a text model with a separately-bolted vision encoder. GPT-4o unified them, which improves performance on tasks that require reasoning across modalities simultaneously.

Practically: GPT-4o can read a chart, understand a diagram, or interpret code in a screenshot and reason about it in the same forward pass as the text in your query. It's also significantly faster and cheaper than GPT-4 Turbo — making it the default for most API production deployments.

The o1/o3/o4 reasoning models — a different paradigm

OpenAI's reasoning model series (o1, o3, o4-mini) takes a fundamentally different approach to hard problems. Rather than generating the answer directly, they produce an extended internal chain of reasoning — a 'thinking' scratchpad — before emitting a final response.

This inference-time compute scaling lets reasoning models outperform GPT-4o on tasks that benefit from deliberate step-by-step reasoning: competitive math, complex debugging, science problems, multi-step logic. The tradeoff: significantly higher latency (10–60 seconds for hard problems) and higher cost.

Task typeGPT-4oo1/o3/o4
General Q&A and writingBetter — fast, fluentOverkill — slow and expensive
Complex multi-step mathOKSignificantly better
Hard competitive codingGoodBest available (o3 tops IOI)
Science reasoning (GPQA)~50%~80%+ (o3)
Instruction followingExcellentGood but less nuanced

The OpenAI API — key patterns for builders

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()  # uses OPENAI_API_KEY

# Basic generation
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an expert in AI systems."},
        {"role": "user", "content": "Explain RLHF in 3 sentences."}
    ]
)
print(response.choices[0].message.content)

# Structured output with Pydantic
class SentimentResult(BaseModel):
    sentiment: str  # positive | negative | neutral
    score: float    # 0-1
    reason: str

result = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Analyse: 'This RAG system is great but slow'"}],
    response_format=SentimentResult,
)
print(result.choices[0].message.parsed)  # type-safe, schema-validated

# Streaming
with client.chat.completions.stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about LLMs"}]
) as stream:
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

OpenAI vs Claude — when to choose which

FactorGPT-4o / OpenAIClaude / Anthropic
Multimodal (vision + audio)Native — best-in-class audioVision strong, no native audio
Context window128K200K
Reasoning modelso1/o3/o4 — highly capableExtended thinking — competitive
Tool use / function callingMature ecosystem, well-documentedStrong, native tool use
Safety / refusalsRLHF-based — binary refusals commonCAI-based — contextual, explains reasoning
Prompt cachingAvailableAvailable — strong for long contexts
Community / ecosystemLargest — most SDKs, integrationsGrowing fast

Compare models in Playground →: Test GPT-4o alongside Claude and Gemini on the same prompts — see where each model shines and where it struggles.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →