GenAI Systems Lab Open interactive version →
Production & LLMOps 12 min read

How Perplexity Serves Millions of AI Searches: Search → LLM → Citations at Scale

Perplexity's real-time search synthesis pipeline — parallel source retrieval, grounding LLM output to sources, citation accuracy, model routing, and how they hit <2s TTFT at production scale.

Perplexity is a real-time search engine that replaces ten blue links with a synthesized answer that cites its sources. Every query triggers a pipeline that retrieves from the live web, synthesizes an answer, and cites exactly which source each claim came from — in under two seconds.

Building this at scale means solving three hard problems simultaneously: freshness (the web changes by the minute), groundedness (every claim must trace back to a source), and latency (users expect near-instant results). Most RAG systems have to solve only one of these at a time.

The pipeline: search → retrieve → synthesize

When a query arrives, Perplexity runs multiple operations in parallel:

The hardest part isn't retrieval — it's grounding. An LLM synthesizing from multiple sources will naturally blend information, add context from training data, and sometimes fabricate details that 'fit' with what it retrieved. Perplexity's system prompt is engineered to prevent all of these.

Citation accuracy: the hardest problem

Every sentence in a Perplexity answer has a citation number. Click it and you go to the source. This looks simple. It is not.

The challenge: LLMs don't naturally generate sentence-level citations. They synthesize from multiple sources simultaneously. A sentence might blend information from three different sources — which one do you cite?

The production solution is post-generation attribution. After the LLM generates the answer, a separate step checks each sentence against the retrieved passages using NLI (natural language inference) entailment. If a sentence is entailed by passage 3, it gets citation [3]. If a sentence isn't entailed by any retrieved passage — it's a hallucination from training data — it gets flagged or removed.

NLI-based citation verification only catches factual hallucinations. It won't catch cases where the model subtly misrepresents what a source says while staying technically consistent with it. Citation verification is a floor, not a ceiling.

Model routing: Sonar and beyond

Not every query needs GPT-4 or Claude Opus. Perplexity uses model routing to match query complexity to model capability:

This is a textbook model routing pattern: use a cheap, specialized model for the common case, route to expensive frontier models for the tail. The savings are significant at Perplexity's scale — millions of queries per day.

Latency engineering: how they hit <2s

Two seconds end-to-end for search + retrieval + LLM synthesis sounds impossible. It's achievable with aggressive parallelism and streaming:

Streaming changes the perceived latency more than it changes actual latency. Users rate a 3-second response with streaming as faster than a 2-second response that appears all at once. Prioritize TTFT over total latency when designing user-facing LLM systems.

Lessons from Perplexity's architecture

Interactive lab:

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →