How I'd Build Document Intelligence
Parsing PDFs, tables, and images; chunking strategies for structured documents; extraction pipelines with citations.
Document intelligence is harder than it looks. PDFs aren't text files — they're a layout format. Tables, headers, footers, multi-column layouts, embedded images, and scanned pages all require different handling. Here's the pipeline I'd build.
Parsing: Don't Use PyPDF2
PyPDF2 and pdfplumber work for simple text-only PDFs. For anything else — forms, tables, scanned pages, multi-column layouts — use a purpose-built parser. PDFMiner for layout-aware extraction, Camelot/pdfplumber for tables, Docling or Unstructured.io for mixed content. For scanned PDFs, you need OCR: Tesseract is free, AWS Textract and Azure Document Intelligence are better for complex layouts.
| Document Type | Parser | Table Extraction | Notes |
|---|---|---|---|
| Text PDF | pdfplumber | Camelot | Fast, accurate for clean PDFs |
| Scanned PDF | AWS Textract / Azure DI | Built-in | OCR quality matters |
| Mixed content | Unstructured.io / Docling | Built-in | Best for mixed layouts |
| Word/Excel | python-docx / openpyxl | Native | Preserve structure |
| HTML | BeautifulSoup | pandas.read_html | Clean semantic structure |
Chunking Strategy for Structured Documents
Don't apply generic chunking to structured documents. Respect document structure: chunk at section boundaries (identified by heading hierarchy), never split tables across chunks, keep figure captions with their associated image/table. For financial reports or legal documents, the section hierarchy is semantic information — losing it degrades retrieval quality significantly.
Handling Tables
Tables are the hardest part. A table cell that says '14.2%' means nothing without the row label ('Gross Margin') and column header ('Q3 2024'). Strategies: serialize tables to Markdown before embedding (preserves structure, works well with LLMs), store tables separately in a structured store and query them with SQL or pandas, or use a multimodal model that can 'see' the table as an image.
Citations and Provenance
Every extracted claim must be traceable. Store with each chunk: document ID, page number, section heading, and character offset. When generating answers, require the LLM to cite sources inline and verify each citation against the stored chunks. This is the difference between a demo and a production document intelligence system.
# Chunk with provenance metadata
chunks = []
for section in parsed_doc.sections:
chunk_text = section.text
chunks.append({
"text": chunk_text,
"metadata": {
"doc_id": doc_id,
"page": section.page_number,
"section": section.heading,
"char_offset": section.start_char,
}
})
Eval: What to Measure
- Extraction accuracy: Do key fields (dates, amounts, names) extract correctly? Test on 50+ labeled documents.
- Citation precision: Are cited sources actually the source of the claim? Human eval sample.
- Table QA accuracy: Ask factual questions about tables, verify against ground truth.
- Hallucination rate: Does the system answer questions not answerable from the document? Should refuse, not hallucinate.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →