AI Engineering 11 min read

How I'd Build Document Intelligence

Parsing PDFs, tables, and images; chunking strategies for structured documents; extraction pipelines with citations.

Document intelligence is harder than it looks. PDFs aren't text files — they're a layout format. Tables, headers, footers, multi-column layouts, embedded images, and scanned pages all require different handling. Here's the pipeline I'd build.

Parsing: Don't Use PyPDF2

PyPDF2 and pdfplumber work for simple text-only PDFs. For anything else — forms, tables, scanned pages, multi-column layouts — use a purpose-built parser. PDFMiner for layout-aware extraction, Camelot/pdfplumber for tables, Docling or Unstructured.io for mixed content. For scanned PDFs, you need OCR: Tesseract is free, AWS Textract and Azure Document Intelligence are better for complex layouts.

Document Type	Parser	Table Extraction	Notes
Text PDF	pdfplumber	Camelot	Fast, accurate for clean PDFs
Scanned PDF	AWS Textract / Azure DI	Built-in	OCR quality matters
Mixed content	Unstructured.io / Docling	Built-in	Best for mixed layouts
Word/Excel	python-docx / openpyxl	Native	Preserve structure
HTML	BeautifulSoup	pandas.read_html	Clean semantic structure

Chunking Strategy for Structured Documents

Don't apply generic chunking to structured documents. Respect document structure: chunk at section boundaries (identified by heading hierarchy), never split tables across chunks, keep figure captions with their associated image/table. For financial reports or legal documents, the section hierarchy is semantic information — losing it degrades retrieval quality significantly.

Handling Tables

Tables are the hardest part. A table cell that says '14.2%' means nothing without the row label ('Gross Margin') and column header ('Q3 2024'). Strategies: serialize tables to Markdown before embedding (preserves structure, works well with LLMs), store tables separately in a structured store and query them with SQL or pandas, or use a multimodal model that can 'see' the table as an image.

Citations and Provenance

Every extracted claim must be traceable. Store with each chunk: document ID, page number, section heading, and character offset. When generating answers, require the LLM to cite sources inline and verify each citation against the stored chunks. This is the difference between a demo and a production document intelligence system.

# Chunk with provenance metadata
chunks = []
for section in parsed_doc.sections:
    chunk_text = section.text
    chunks.append({
        "text": chunk_text,
        "metadata": {
            "doc_id": doc_id,
            "page": section.page_number,
            "section": section.heading,
            "char_offset": section.start_char,
        }
    })

Eval: What to Measure

Extraction accuracy: Do key fields (dates, amounts, names) extract correctly? Test on 50+ labeled documents.
Citation precision: Are cited sources actually the source of the claim? Human eval sample.
Table QA accuracy: Ask factual questions about tables, verify against ground truth.
Hallucination rate: Does the system answer questions not answerable from the document? Should refuse, not hallucinate.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →