The GPT-4 Technical Report: What OpenAI Told Us (and What They Didn't)
The most-read model paper that reveals almost nothing about architecture or training. What the report actually contains — benchmark analysis, safety evaluations, system card — and how to read it.
In March 2023, OpenAI released GPT-4 and published its technical report. By design, it's one of the least informative papers ever published about a major model. No architecture. No training data. No parameter count. No training process.
And yet it's one of the most-read AI documents — because it introduced the template every frontier model release now follows, and what it does disclose tells you a great deal about how to evaluate and deploy frontier models responsibly.
What the report actually contains
- Benchmark results on academic evaluations (MMLU, HumanEval, bar exam, medical licensing exams)
- A system card — known failure modes, risk evaluations, and red team findings
- First public description of GPT-4's multimodal capabilities (image input)
- Calibration analysis: how well stated confidence corresponds to actual accuracy
- A predictable scaling result: GPT-4's benchmark performance was predictable from smaller model training runs
What the report deliberately omits: architecture, parameter count, training dataset, compute, RLHF methodology, safety training details. OpenAI cites 'competitive landscape and safety implications'. This set the precedent that frontier model papers are marketing documents with evaluation data attached.
The benchmark results in context
| Benchmark | GPT-4 Score | What It Tests |
|---|---|---|
| MMLU | 86.4% | Knowledge breadth across 57 academic subjects |
| HumanEval | 67% | Python function completion from docstrings |
| Bar exam | 90th percentile | Legal reasoning and memorisation |
| LMSYS Chatbot Arena | Varies | Human preference in head-to-head — more reliable |
Benchmark results from a model's own technical report require scepticism. Always cross-reference with third-party evaluations like HELM and Chatbot Arena, which show independent rankings.
The system card: what red teams found
- Hallucination: GPT-4 still confidently produces incorrect information, including fabricated citations
- Sycophancy: the model agrees with users even when wrong, and can be talked into incorrect answers with pushback
- Jailbreaking: adversarial prompts could bypass safety training (patched before release)
- Unsafe content in low-resource languages: safety training less effective in underrepresented languages
Every major model ships with a system card. Reading it before deploying in production is as important as reading benchmark results. The system card tells you the known failure modes — ignore it and you'll rediscover them yourself.
Compare GPT-4 with other frontier models →: Benchmark GPT-4 against Claude and Gemini on standardised tasks.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →