GenAI Systems Lab Open interactive version →
AI Engineering 11 min read

How I'd Build Document Intelligence

Parsing PDFs, tables, and images; chunking strategies for structured documents; extraction pipelines with citations.

Document intelligence is harder than it looks. PDFs aren't text files — they're a layout format. Tables, headers, footers, multi-column layouts, embedded images, and scanned pages all require different handling. Here's the pipeline I'd build.

Parsing: Don't Use PyPDF2

PyPDF2 and pdfplumber work for simple text-only PDFs. For anything else — forms, tables, scanned pages, multi-column layouts — use a purpose-built parser. PDFMiner for layout-aware extraction, Camelot/pdfplumber for tables, Docling or Unstructured.io for mixed content. For scanned PDFs, you need OCR: Tesseract is free, AWS Textract and Azure Document Intelligence are better for complex layouts.

Document TypeParserTable ExtractionNotes
Text PDFpdfplumberCamelotFast, accurate for clean PDFs
Scanned PDFAWS Textract / Azure DIBuilt-inOCR quality matters
Mixed contentUnstructured.io / DoclingBuilt-inBest for mixed layouts
Word/Excelpython-docx / openpyxlNativePreserve structure
HTMLBeautifulSouppandas.read_htmlClean semantic structure

Chunking Strategy for Structured Documents

Don't apply generic chunking to structured documents. Respect document structure: chunk at section boundaries (identified by heading hierarchy), never split tables across chunks, keep figure captions with their associated image/table. For financial reports or legal documents, the section hierarchy is semantic information — losing it degrades retrieval quality significantly.

Handling Tables

Tables are the hardest part. A table cell that says '14.2%' means nothing without the row label ('Gross Margin') and column header ('Q3 2024'). Strategies: serialize tables to Markdown before embedding (preserves structure, works well with LLMs), store tables separately in a structured store and query them with SQL or pandas, or use a multimodal model that can 'see' the table as an image.

Citations and Provenance

Every extracted claim must be traceable. Store with each chunk: document ID, page number, section heading, and character offset. When generating answers, require the LLM to cite sources inline and verify each citation against the stored chunks. This is the difference between a demo and a production document intelligence system.

# Chunk with provenance metadata
chunks = []
for section in parsed_doc.sections:
    chunk_text = section.text
    chunks.append({
        "text": chunk_text,
        "metadata": {
            "doc_id": doc_id,
            "page": section.page_number,
            "section": section.heading,
            "char_offset": section.start_char,
        }
    })

Eval: What to Measure


Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →