Production & LLMOps 12 min read

How I'd Build an Internal Knowledge Base Search in 2025

A concrete walkthrough of the architecture, vendor choices, chunking strategy, and eval harness for an AI search system over company documentation.

Here's the brief: a Slack-scale company (2,000 employees) with 200,000 documents across Notion, Confluence, and Google Docs. They want AI-powered search — someone types a question in Slack, gets an answer with citations within 2 seconds at P95. They've tried OpenAI embeddings + a basic vector search and it's producing mediocre results. They want it done right this time.

This is a concrete system design walkthrough. Every major decision has a reason. Some of those reasons are things I got wrong the first time.

The Architecture Decision: Hybrid Search

The first decision is the most impactful: pure dense vector search or hybrid? After running both on a sample of 500 queries from the company's actual Slack history, hybrid search (dense vectors + BM25 keyword search, fused with Reciprocal Rank Fusion) outperformed pure vector search on 68% of queries, with particularly large wins on queries containing product names, employee names, and technical terms.

Dense vector search finds semantically similar content. BM25 finds exact keyword matches. Hybrid search with RRF fusion gives you both — and for enterprise knowledge bases with proper nouns, product names, and technical jargon, the exact-match component is often the difference between a useful answer and a useless one.

Query Type	Dense Only	BM25 Only	Hybrid (RRF)
'What is our parental leave policy?'	✓ Good	✓ Good	✓ Best
'Jira ticket PROD-2847'	✗ Often misses	✓ Good	✓ Good
'How does ProjectNova handle authentication?'	✗ Miss on code name	✓ Keyword match	✓ Best
'explain our deployment process'	✓ Good	✗ Too literal	✓ Best

Build vs. Buy: The Vector Store Decision

The team was tempted to build on top of their existing Elasticsearch cluster. I've seen this go wrong three times now. Elasticsearch + the dense_vector field works, but managing a hybrid search stack where you're running BM25 in Elasticsearch and dense search in a sidecar pgvector instance, then writing your own RRF fusion layer, means you own three systems instead of one. The operational overhead is significant.

The decision came down to: pgvector (if you're already on Postgres) for up to ~5M vectors, managed Weaviate for 5M+ vectors or if you want built-in BM25 hybrid search, and Qdrant for cost-sensitive self-hosted deployments. For 200K documents (roughly 1–2M chunks), managed Weaviate at ~$200/month is the right call. You get native hybrid search, built-in RRF, metadata filtering, and you don't manage infrastructure.

pgvector: best if you're already on Postgres, <5M vectors, want to minimize new dependencies
Weaviate managed: best for hybrid search out-of-the-box, 1M–50M vectors, willing to pay for managed infra
Qdrant: best for cost-sensitive self-hosted, excellent performance, more ops work
Pinecone: easiest to start with, expensive at scale, limited query-time metadata filtering options

Chunking Strategy: Markdown-Aware Semantic Chunking

The naive approach — fixed 512-token chunks with 64-token overlap — produces mediocre results for Notion/Confluence documents. These documents have headers, bullet points, tables, and code blocks. A fixed-size chunk frequently splits in the middle of a table or a header-to-content unit, destroying the semantic coherence of the chunk.

The better approach: markdown-aware chunking that respects document structure. Parse headers (H1, H2, H3) as natural boundaries. Never split in the middle of a table. Preserve each H2 section as a unit when it fits within the token limit. Apply fixed-size chunking only within sections that exceed the limit.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "h1"), ("##", "h2"), ("###", "h3")
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
header_chunks = splitter.split_text(document_content)

# Then apply token-level chunking within each header section
# but preserve header metadata for each chunk
for chunk in header_chunks:
    chunk.metadata["headers"] = chunk.metadata  # propagate header path
    # chunk.metadata["doc_id"] = document.id
    # chunk.metadata["last_modified"] = document.last_modified
    # chunk.metadata["user_groups"] = document.permissions

Target chunk size: 512 tokens (sweet spot for retrieval precision vs. context coverage)
Overlap: 64 tokens (enough for continuity without excessive duplication)
Preserve headers as metadata on each chunk — not just the content
Preserve doc_id, last_modified, and user_groups as metadata for access control and freshness filtering

The Embedding Pipeline

Embedding model choice: Voyage AI's voyage-3 (formerly embed-3) is the best quality-to-cost ratio for English enterprise content as of 2025. It outperforms OpenAI text-embedding-3-large on retrieval benchmarks (MTEB) at lower cost. For a 200K-document corpus that needs daily incremental updates, the math matters.

Model	MTEB Retrieval Score	Cost per 1M tokens	Dim
Voyage voyage-3	~70.1	$0.06	1024
OpenAI text-embedding-3-large	~64.6	$0.13	3072
OpenAI text-embedding-3-small	~62.3	$0.02	1536
Cohere embed-v3	~64.5	$0.10	1024

For the initial 200K document corpus (roughly 300M tokens after chunking), total embedding cost is approximately $18. Daily incremental updates (assuming 1% churn = 2,000 docs/day) add roughly $0.50/day. Embedding is not the cost center — LLM synthesis at query time is.

The pipeline: document webhook → parse and clean → markdown-aware chunk → batch embed (512 chunks per API call) → upsert to Weaviate with metadata. For stale documents (the most common failure mode), store last_modified and reembed on webhook-triggered document updates.

Access Control: Filter at Query Time, Not Index Time

This is the decision that most teams get wrong. If you have 2,000 employees with different document permissions, you have two choices:

Index-time isolation: maintain separate indices per permission group. Operationally painful — N indices to maintain, queries hit one index at a time.
Query-time filtering: index everything together. At query time, pass the user's group membership as a metadata filter to restrict which chunks are returned.

Query-time filtering is the right choice for enterprise knowledge bases. Weaviate supports metadata filters on vector queries natively. Store user_groups as a list on each chunk (e.g., ['engineering', 'all-company']). At query time, filter where user_groups contains the requesting user's groups. This gives you correct access control without index proliferation.

One critical gotcha: if you're using a caching layer (e.g., Redis) to cache query results, you must cache per (query, user_groups) pair — not just per query. A cached result for a query from an all-access admin should never be served to a restricted user.

The Eval Harness

An AI search system without an eval harness is a guess that's hard to improve. We built a minimal but rigorous harness:

100 human-labeled query → relevant document pairs, sampled from actual Slack search queries
Primary metric: nDCG@5 (normalized Discounted Cumulative Gain at rank 5) — penalizes relevant docs at rank 4 more than rank 1
Secondary metric: MRR@10 (Mean Reciprocal Rank) — how often is the best result in the top 10?
Weekly automated regression suite: if nDCG@5 drops >3% vs. baseline, flag for review before shipping
Monthly human refresh: 10 new labeled queries added to the eval set each month from recent Slack history

We deliberately chose a small, high-quality eval set over a large noisy one. 100 carefully labeled queries catches regressions reliably. 1,000 noisily labeled queries has enough label errors to obscure real regressions.

Real Numbers: Monthly Cost Breakdown

Component	Monthly Cost	Notes
Embedding (incremental updates)	~$120	~200M tokens/month at $0.06/1M
Weaviate managed (Starter tier)	$200	Up to 5M vectors, includes hybrid search
LLM synthesis (GPT-4o mini)	~$800	~4M queries/month at $0.20 avg query cost
Redis caching layer	$50	Caches hot queries, reduces LLM calls by ~40%
Total	~$1,170/month	~$0.58/user/month for 2,000 users

LLM synthesis is the largest cost — 68% of total. This is typical. The database and embedding costs are relatively fixed; the synthesis cost scales with query volume and response length. Every optimization to reduce unnecessary LLM calls (caching, query classification to route simple queries to cheaper models) has an outsized impact.

Failure Modes We Hit in Production

Stale Document Retrieval

The most common failure. A document is updated in Notion, but the old version is still in the vector index. The webhook fires, but the embedding pipeline is queued and doesn't run for 4 hours. User asks a question that touches the updated doc, gets the old answer with full confidence.

Fix: embed last_modified on every chunk. Add a freshness signal to the retrieval score — slightly down-rank chunks from documents that haven't been touched in 180+ days for time-sensitive query types (policy, process). For critical documents, implement forced re-embedding on any update with no queue delay.

Query-to-Document Length Mismatch

Short queries (3–5 words) retrieve poorly against long chunks. The embedding model produces a dense, information-rich embedding for a 400-token chunk; a 4-word query's embedding sits far from any chunk in the vector space even when semantically related.

Fix: HyDE (Hypothetical Document Embeddings). Instead of embedding the raw query, use a small LLM to generate a hypothetical answer, then embed that. The embedding of a hypothetical answer sits much closer to real answers in the vector space. For short queries, this improves retrieval quality measurably.

PDF Table Extraction Failures

15% of the document corpus is PDFs exported from Google Docs. PDF parsing with standard libraries (PyMuPDF, pdfplumber) loses table formatting — what was a structured comparison table becomes unparseable concatenated text. The embedding model can't recover semantic structure from jumbled table cells.

Fix: route PDFs through a vision-capable model for table pages. Detect table regions (most PDF parsing libraries return bounding boxes), send the rendered page image to Claude or GPT-4o Vision, get back structured markdown. Expensive per-page but only needed for table-heavy pages.

What We'd Do Differently

Build the eval harness before the system, not after. We spent two weeks arguing about whether retrieval quality was 'good enough' before we had numbers. The harness takes one day to build.
Implement query logging from day one. You need actual user queries to build a realistic eval set. Synthetic queries are a poor substitute.
Start with 300-token chunks instead of 512. We've consistently found that smaller chunks improve precision at retrieval (less irrelevant content per chunk), even if they require more chunks per answer.
Budget for document quality gate from the start. Not all 200K documents are worth indexing. Outdated policy docs, archived proposals, and draft documents create more noise than signal. A simple freshness + recency filter before indexing would have saved significant cleanup work.

Try RAG Lab →: The RAG Lab lets you configure chunking strategy, embedding model, and retrieval parameters — and see how each choice affects retrieval quality on real query examples.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →