Process Reward Models: How o1 and o3 Learn to Think Step by Step
Why outcome-based RLHF fails on multi-step reasoning and how process reward models — rewarding each step rather than the final answer — produce the reasoning models behind o1, o3, and DeepSeek-R1.
Standard RLHF trains a reward model to score the final output. Good final answer: high reward. Bad final answer: low reward. For many tasks — chat quality, writing style, instruction following — this outcome-based reward works well. But for tasks that require multi-step reasoning, it fails in a specific and predictable way.
Imagine a model solving a math problem. It takes 8 reasoning steps. At step 4, it makes a subtle error. Steps 5–8 are internally consistent with the error. The final answer is wrong. An outcome reward model scores it low — but it can't tell the model where it went wrong. The model learns 'this reasoning trace is bad' without learning which step was the problem.
Process Reward Models (PRMs) solve this by rewarding each intermediate step instead of only the final answer. They're the training technique behind OpenAI's o1 and o3 reasoning models, DeepSeek-R1, and other 'thinking' models that produce extended chain-of-thought traces.
Outcome reward models vs. process reward models
| Aspect | Outcome Reward Model (ORM) | Process Reward Model (PRM) |
|---|---|---|
| What gets scored | Final answer only | Each reasoning step |
| Training signal | Sparse — one score per rollout | Dense — one score per step |
| Data required | Final answer correct/incorrect labels | Step-level correctness labels (expensive) |
| Good for | Instruction following, style, simple QA | Multi-step math, code, complex reasoning |
| Failure mode | Can't distinguish good reasoning → wrong answer from bad reasoning → right answer | Expensive to label; can overfit to step-level patterns |
How PRM training data is collected
The most expensive part of PRM training is labelling. For each reasoning trace, a human (or automated verifier) labels each step as correct or incorrect. This requires domain expertise — labelling whether step 4 of a calculus proof is logically valid is not a task you can outsource to general annotators.
For domains with verifiable correctness — mathematics, formal logic, code execution — automated verifiers can replace human labellers. This is why math was the proving ground for process reward models: you can verify whether each step is mathematically valid without a human.
The key insight of PRMs: for tasks with verifiable intermediate steps, you can train a much denser reward signal than outcome-based approaches. This dense feedback dramatically improves reasoning quality — but only for tasks where intermediate steps can be verified.
Best-of-N search with PRMs
PRMs are often used not just for training but for inference-time search. Generate N candidate reasoning traces, score each step of each trace with the PRM, select the trace with the highest overall step-quality score. This 'best-of-N' search is how o1 and similar models dramatically improve performance on hard reasoning problems at inference time — by spending more compute to verify reasoning quality.
What this means for o1 and o3
OpenAI hasn't published full details of o1's training methodology, but the pattern is consistent with: (1) a base reasoning model trained with SFT and RLHF on chain-of-thought traces, (2) a PRM trained on step-level correctness labels for math and code tasks, (3) RL training using the PRM as the reward signal, (4) inference-time best-of-N search or beam search guided by the PRM. The 'thinking' shown in o1's responses is the CoT trace that was scored and selected by the PRM.
DeepSeek-R1 demonstrated that a similar approach could be replicated open-source, achieving comparable reasoning performance at a fraction of the compute cost by using Group Relative Policy Optimisation (GRPO) instead of PPO.
For teams building reasoning-intensive AI systems: you don't need to train your own PRM. You can approximate PRM-guided inference by using an LLM-as-judge to score intermediate reasoning steps at inference time, then selecting the best trace from N rollouts. Expensive at inference but available without training.
Compare reasoning models →: See how o1-style reasoning models differ from standard instruction-tuned models.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →