GenAI Systems Lab Open interactive version →
Production & LLMOps 12 min read

How I'd Build an Internal Knowledge Base Search in 2025

A concrete walkthrough of the architecture, vendor choices, chunking strategy, and eval harness for an AI search system over company documentation.

Here's the brief: a Slack-scale company (2,000 employees) with 200,000 documents across Notion, Confluence, and Google Docs. They want AI-powered search — someone types a question in Slack, gets an answer with citations within 2 seconds at P95. They've tried OpenAI embeddings + a basic vector search and it's producing mediocre results. They want it done right this time.

This is a concrete system design walkthrough. Every major decision has a reason. Some of those reasons are things I got wrong the first time.

The Architecture Decision: Hybrid Search

The first decision is the most impactful: pure dense vector search or hybrid? After running both on a sample of 500 queries from the company's actual Slack history, hybrid search (dense vectors + BM25 keyword search, fused with Reciprocal Rank Fusion) outperformed pure vector search on 68% of queries, with particularly large wins on queries containing product names, employee names, and technical terms.

Dense vector search finds semantically similar content. BM25 finds exact keyword matches. Hybrid search with RRF fusion gives you both — and for enterprise knowledge bases with proper nouns, product names, and technical jargon, the exact-match component is often the difference between a useful answer and a useless one.

Query TypeDense OnlyBM25 OnlyHybrid (RRF)
'What is our parental leave policy?'✓ Good✓ Good✓ Best
'Jira ticket PROD-2847'✗ Often misses✓ Good✓ Good
'How does ProjectNova handle authentication?'✗ Miss on code name✓ Keyword match✓ Best
'explain our deployment process'✓ Good✗ Too literal✓ Best

Build vs. Buy: The Vector Store Decision

The team was tempted to build on top of their existing Elasticsearch cluster. I've seen this go wrong three times now. Elasticsearch + the dense_vector field works, but managing a hybrid search stack where you're running BM25 in Elasticsearch and dense search in a sidecar pgvector instance, then writing your own RRF fusion layer, means you own three systems instead of one. The operational overhead is significant.

The decision came down to: pgvector (if you're already on Postgres) for up to ~5M vectors, managed Weaviate for 5M+ vectors or if you want built-in BM25 hybrid search, and Qdrant for cost-sensitive self-hosted deployments. For 200K documents (roughly 1–2M chunks), managed Weaviate at ~$200/month is the right call. You get native hybrid search, built-in RRF, metadata filtering, and you don't manage infrastructure.

Chunking Strategy: Markdown-Aware Semantic Chunking

The naive approach — fixed 512-token chunks with 64-token overlap — produces mediocre results for Notion/Confluence documents. These documents have headers, bullet points, tables, and code blocks. A fixed-size chunk frequently splits in the middle of a table or a header-to-content unit, destroying the semantic coherence of the chunk.

The better approach: markdown-aware chunking that respects document structure. Parse headers (H1, H2, H3) as natural boundaries. Never split in the middle of a table. Preserve each H2 section as a unit when it fits within the token limit. Apply fixed-size chunking only within sections that exceed the limit.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "h1"), ("##", "h2"), ("###", "h3")
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
header_chunks = splitter.split_text(document_content)

# Then apply token-level chunking within each header section
# but preserve header metadata for each chunk
for chunk in header_chunks:
    chunk.metadata["headers"] = chunk.metadata  # propagate header path
    # chunk.metadata["doc_id"] = document.id
    # chunk.metadata["last_modified"] = document.last_modified
    # chunk.metadata["user_groups"] = document.permissions

The Embedding Pipeline

Embedding model choice: Voyage AI's voyage-3 (formerly embed-3) is the best quality-to-cost ratio for English enterprise content as of 2025. It outperforms OpenAI text-embedding-3-large on retrieval benchmarks (MTEB) at lower cost. For a 200K-document corpus that needs daily incremental updates, the math matters.

ModelMTEB Retrieval ScoreCost per 1M tokensDim
Voyage voyage-3~70.1$0.061024
OpenAI text-embedding-3-large~64.6$0.133072
OpenAI text-embedding-3-small~62.3$0.021536
Cohere embed-v3~64.5$0.101024

For the initial 200K document corpus (roughly 300M tokens after chunking), total embedding cost is approximately $18. Daily incremental updates (assuming 1% churn = 2,000 docs/day) add roughly $0.50/day. Embedding is not the cost center — LLM synthesis at query time is.

The pipeline: document webhook → parse and clean → markdown-aware chunk → batch embed (512 chunks per API call) → upsert to Weaviate with metadata. For stale documents (the most common failure mode), store last_modified and reembed on webhook-triggered document updates.

Access Control: Filter at Query Time, Not Index Time

This is the decision that most teams get wrong. If you have 2,000 employees with different document permissions, you have two choices:

Query-time filtering is the right choice for enterprise knowledge bases. Weaviate supports metadata filters on vector queries natively. Store user_groups as a list on each chunk (e.g., ['engineering', 'all-company']). At query time, filter where user_groups contains the requesting user's groups. This gives you correct access control without index proliferation.

One critical gotcha: if you're using a caching layer (e.g., Redis) to cache query results, you must cache per (query, user_groups) pair — not just per query. A cached result for a query from an all-access admin should never be served to a restricted user.

The Eval Harness

An AI search system without an eval harness is a guess that's hard to improve. We built a minimal but rigorous harness:

We deliberately chose a small, high-quality eval set over a large noisy one. 100 carefully labeled queries catches regressions reliably. 1,000 noisily labeled queries has enough label errors to obscure real regressions.

Real Numbers: Monthly Cost Breakdown

ComponentMonthly CostNotes
Embedding (incremental updates)~$120~200M tokens/month at $0.06/1M
Weaviate managed (Starter tier)$200Up to 5M vectors, includes hybrid search
LLM synthesis (GPT-4o mini)~$800~4M queries/month at $0.20 avg query cost
Redis caching layer$50Caches hot queries, reduces LLM calls by ~40%
Total~$1,170/month~$0.58/user/month for 2,000 users

LLM synthesis is the largest cost — 68% of total. This is typical. The database and embedding costs are relatively fixed; the synthesis cost scales with query volume and response length. Every optimization to reduce unnecessary LLM calls (caching, query classification to route simple queries to cheaper models) has an outsized impact.

Failure Modes We Hit in Production

Stale Document Retrieval

The most common failure. A document is updated in Notion, but the old version is still in the vector index. The webhook fires, but the embedding pipeline is queued and doesn't run for 4 hours. User asks a question that touches the updated doc, gets the old answer with full confidence.

Fix: embed last_modified on every chunk. Add a freshness signal to the retrieval score — slightly down-rank chunks from documents that haven't been touched in 180+ days for time-sensitive query types (policy, process). For critical documents, implement forced re-embedding on any update with no queue delay.

Query-to-Document Length Mismatch

Short queries (3–5 words) retrieve poorly against long chunks. The embedding model produces a dense, information-rich embedding for a 400-token chunk; a 4-word query's embedding sits far from any chunk in the vector space even when semantically related.

Fix: HyDE (Hypothetical Document Embeddings). Instead of embedding the raw query, use a small LLM to generate a hypothetical answer, then embed that. The embedding of a hypothetical answer sits much closer to real answers in the vector space. For short queries, this improves retrieval quality measurably.

PDF Table Extraction Failures

15% of the document corpus is PDFs exported from Google Docs. PDF parsing with standard libraries (PyMuPDF, pdfplumber) loses table formatting — what was a structured comparison table becomes unparseable concatenated text. The embedding model can't recover semantic structure from jumbled table cells.

Fix: route PDFs through a vision-capable model for table pages. Detect table regions (most PDF parsing libraries return bounding boxes), send the rendered page image to Claude or GPT-4o Vision, get back structured markdown. Expensive per-page but only needed for table-heavy pages.

What We'd Do Differently

Try RAG Lab →: The RAG Lab lets you configure chunking strategy, embedding model, and retrieval parameters — and see how each choice affects retrieval quality on real query examples.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →