GenAI Systems Lab Open interactive version →
AI Engineering 9 min read

Multimodal in Production: Latency, Cost, and Deployment Realities

Vision tokens are expensive — GPT-4V charges per tile. The real latency and cost profile of multimodal calls vs. text-only, how to pre-process images to cut costs, and the caching strategies that matter for production workloads.

Multimodal models have a fundamentally different cost and latency profile from text-only LLMs. A text query of 500 tokens costs a fraction of what a 500-token query with a high-resolution image costs — because images expand into hundreds of additional tokens that go through the full attention stack. Understanding this is critical before you design a multimodal system.

The Token Cost of Images

GPT-4V charges for images by tile. A 512×512 image = 1 tile = 85 tokens. A 1024×1024 image = 4 tiles = 340 tokens + 85 base = 425 tokens. A 2048×2048 image = up to 2048 tokens. At gpt-4o pricing ($5/1M input tokens), a single high-resolution document page costs $0.01–0.02 in image tokens alone — before you write a word of prompt or receive a response.

Image SizeGPT-4V TilesApprox TokensCost @ $5/1M
512×512 (low)1~85$0.0004
1024×10245~425$0.002
1792×10247~595$0.003
2048×2048 (high)~16~1360$0.007
4096×4096 (max)~64~5440$0.027

Latency Profile

Multimodal calls add two sources of latency vs. text-only: image preprocessing (resize, tile, encode) and the additional token processing through attention. At 576–2048 visual tokens, the prefill step (processing the input tokens) is significantly more expensive. For streaming responses, time-to-first-token (TTFT) is noticeably worse for multimodal calls.

Cost Reduction Strategies

Self-Hosting vs. API

For high-volume multimodal workloads, self-hosting an open-weight model (LLaVA-1.6-34B, InternVL2-26B) on owned GPUs can be 20–50× cheaper than GPT-4V API at scale. The break-even point is roughly 50k–100k multimodal calls per month, depending on your GPU cost and model choice. Below that volume, API is cheaper when you factor in engineering and infrastructure cost.

Monitoring Multimodal Systems

The cheapest multimodal production stack for document processing: classify page type (text vs. visual) → extract text for text pages → LLaVA-1.6-7B (self-hosted) for visual pages → GPT-4V only for complex visual reasoning on ambiguous pages. This tiered approach typically cuts multimodal API cost by 70–90% with minimal accuracy loss.

Systems →: Explore cost and latency optimization in the Systems tab.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →