Multimodal in Production: Latency, Cost, and Deployment Realities
Vision tokens are expensive — GPT-4V charges per tile. The real latency and cost profile of multimodal calls vs. text-only, how to pre-process images to cut costs, and the caching strategies that matter for production workloads.
Multimodal models have a fundamentally different cost and latency profile from text-only LLMs. A text query of 500 tokens costs a fraction of what a 500-token query with a high-resolution image costs — because images expand into hundreds of additional tokens that go through the full attention stack. Understanding this is critical before you design a multimodal system.
The Token Cost of Images
GPT-4V charges for images by tile. A 512×512 image = 1 tile = 85 tokens. A 1024×1024 image = 4 tiles = 340 tokens + 85 base = 425 tokens. A 2048×2048 image = up to 2048 tokens. At gpt-4o pricing ($5/1M input tokens), a single high-resolution document page costs $0.01–0.02 in image tokens alone — before you write a word of prompt or receive a response.
| Image Size | GPT-4V Tiles | Approx Tokens | Cost @ $5/1M |
|---|---|---|---|
| 512×512 (low) | 1 | ~85 | $0.0004 |
| 1024×1024 | 5 | ~425 | $0.002 |
| 1792×1024 | 7 | ~595 | $0.003 |
| 2048×2048 (high) | ~16 | ~1360 | $0.007 |
| 4096×4096 (max) | ~64 | ~5440 | $0.027 |
Latency Profile
Multimodal calls add two sources of latency vs. text-only: image preprocessing (resize, tile, encode) and the additional token processing through attention. At 576–2048 visual tokens, the prefill step (processing the input tokens) is significantly more expensive. For streaming responses, time-to-first-token (TTFT) is noticeably worse for multimodal calls.
Cost Reduction Strategies
- Downscale before sending: Most tasks (document classification, simple Q&A over text-heavy pages) don't need high resolution. Sending at 512px instead of 2048px cuts token cost by 16×. Always benchmark accuracy at each resolution before assuming high-res is needed.
- Pre-extract text where possible: If a PDF page is native text (not scanned), extract the text with a PDF parser and send it as text tokens. Text is dramatically cheaper than image tokens for the same information.
- Route by content type: Use a cheap classifier to decide whether a document page needs multimodal processing (has a chart or diagram) or can be handled as text-only (pure text page). Most enterprise PDFs are >80% text-only pages.
- Cache visual embeddings: if the same image is queried multiple times (a product catalog), cache the image encoding rather than reprocessing it each time.
- Use smaller models for simpler tasks: Phi-3-Vision (3.8B) and LLaVA-1.6-7B are competitive with GPT-4V for structured extraction tasks at 50–100× lower cost.
Self-Hosting vs. API
For high-volume multimodal workloads, self-hosting an open-weight model (LLaVA-1.6-34B, InternVL2-26B) on owned GPUs can be 20–50× cheaper than GPT-4V API at scale. The break-even point is roughly 50k–100k multimodal calls per month, depending on your GPU cost and model choice. Below that volume, API is cheaper when you factor in engineering and infrastructure cost.
Monitoring Multimodal Systems
- Track visual hallucination rate on a sample of outputs — spot check a few hundred responses weekly against the source images.
- Monitor resolution distribution of incoming images — alert when users start sending higher-res images than your cost model assumed.
- Track per-call token counts broken down by image vs. text — image token cost spikes are often caused by a single user uploading large images.
- Add a response validation step for structured extraction tasks (table → JSON) — validate schema compliance and numeric plausibility before returning to the user.
The cheapest multimodal production stack for document processing: classify page type (text vs. visual) → extract text for text pages → LLaVA-1.6-7B (self-hosted) for visual pages → GPT-4V only for complex visual reasoning on ambiguous pages. This tiered approach typically cuts multimodal API cost by 70–90% with minimal accuracy loss.
Systems →: Explore cost and latency optimization in the Systems tab.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →