AI Engineering 10 min read

The GPT-4 Technical Report: What OpenAI Told Us (and What They Didn't)

The most-read model paper that reveals almost nothing about architecture or training. What the report actually contains — benchmark analysis, safety evaluations, system card — and how to read it.

In March 2023, OpenAI released GPT-4 and published its technical report. By design, it's one of the least informative papers ever published about a major model. No architecture. No training data. No parameter count. No training process.

And yet it's one of the most-read AI documents — because it introduced the template every frontier model release now follows, and what it does disclose tells you a great deal about how to evaluate and deploy frontier models responsibly.

What the report actually contains

Benchmark results on academic evaluations (MMLU, HumanEval, bar exam, medical licensing exams)
A system card — known failure modes, risk evaluations, and red team findings
First public description of GPT-4's multimodal capabilities (image input)
Calibration analysis: how well stated confidence corresponds to actual accuracy
A predictable scaling result: GPT-4's benchmark performance was predictable from smaller model training runs

What the report deliberately omits: architecture, parameter count, training dataset, compute, RLHF methodology, safety training details. OpenAI cites 'competitive landscape and safety implications'. This set the precedent that frontier model papers are marketing documents with evaluation data attached.

The benchmark results in context

Benchmark	GPT-4 Score	What It Tests
MMLU	86.4%	Knowledge breadth across 57 academic subjects
HumanEval	67%	Python function completion from docstrings
Bar exam	90th percentile	Legal reasoning and memorisation
LMSYS Chatbot Arena	Varies	Human preference in head-to-head — more reliable

Benchmark results from a model's own technical report require scepticism. Always cross-reference with third-party evaluations like HELM and Chatbot Arena, which show independent rankings.

The system card: what red teams found

Hallucination: GPT-4 still confidently produces incorrect information, including fabricated citations
Sycophancy: the model agrees with users even when wrong, and can be talked into incorrect answers with pushback
Jailbreaking: adversarial prompts could bypass safety training (patched before release)
Unsafe content in low-resource languages: safety training less effective in underrepresented languages

Every major model ships with a system card. Reading it before deploying in production is as important as reading benchmark results. The system card tells you the known failure modes — ignore it and you'll rediscover them yourself.

Compare GPT-4 with other frontier models →: Benchmark GPT-4 against Claude and Gemini on standardised tasks.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →