Debugging AI Integrations at Customer Sites: Latency, Context Window Surprises, Prompt Drift
Instrument each latency phase separately. Silent context window truncation. Token cost explosions from full-document insertion. Prompt drift: always pin explicit model version strings, run nightly golden-set evals. The customer debugging conversation.
The AI integration that worked in your dev environment fails at the customer site. The symptoms: responses are slower, outputs are degrading over time, the context window is filling up in unexpected ways. FDE and integration-focused AI engineers spend a disproportionate amount of their time on this class of problem.
The Latency Investigation
Latency in AI integrations has four sources: network (API call to model provider), prompt construction (building the context from customer data), model inference (time for the model to generate), and post-processing (parsing and applying the output).
- Instrument each phase separately before debugging anything. 'The AI is slow' is not a bug report. Network latency: model provider round-trip. Measure with a minimal prompt. If this is slow, it's infrastructure (VPC peering, geographic routing, rate limiting). Prompt construction: if you're doing retrieval, reranking, or document processing, time this independently. Inference latency: scales with output token count. If responses are long, latency is high. Add max_tokens constraint. Post-processing: regex parsing, JSON extraction — sometimes this is the bottleneck, not the model.
# Latency instrumentation pattern
import time
def timed_integration(query, context_docs):
t0 = time.perf_counter()
# Phase 1: retrieval
retrieved = retriever.search(query, top_k=5)
t1 = time.perf_counter()
# Phase 2: prompt construction
prompt = build_prompt(query, retrieved)
t2 = time.perf_counter()
# Phase 3: API call
response = llm.complete(prompt, max_tokens=512)
t3 = time.perf_counter()
# Phase 4: post-processing
result = parse_response(response)
t4 = time.perf_counter()
print(f'retrieval:{t1-t0:.2f}s prompt:{t2-t1:.2f}s inference:{t3-t2:.2f}s parse:{t4-t3:.2f}s')
return result
Context Window Surprises
Context window issues show up in three ways: truncation (input exceeds max context, early content is silently dropped), cost overruns (customer's documents are longer than estimated, token costs are 10x projected), and attention dilution (very long contexts reduce model focus on relevant parts).
- Always log token counts per request in production. You will be surprised by the distribution. Set hard limits on retrieved document length. A customer document that's 50 pages will blow your context budget. Chunk before retrieval, not after. If you retrieve full documents and then chunk, you've wasted retrieval quality. For conversation history: use summarization-based memory, not unbounded history. A 100-turn conversation will eventually exceed the context window and the model will lose early context silently.
The silent truncation failure: most LLM APIs truncate input to max_context without warning. The model sees a partial prompt and generates a response that seems plausible but is based on incomplete information. Add token counting before every API call.
Prompt Drift
Prompt drift: the model's behavior changes over time without you changing the prompt. Causes: model provider updates the underlying model, the customer's data distribution shifts, the retrieved context quality degrades as the document store grows.
- Version-lock your model: use explicit model version strings (gpt-4-0613, claude-3-5-sonnet-20241022), not aliases (gpt-4, claude). Aliases change underneath you. Run a nightly eval on a fixed golden set of 20–50 prompt/response pairs. Alert if accuracy drops 5%. When customer data grows, retrieval quality can degrade: add more data, chunking strategy may need retuning, embedding model may need retraining. For classification tasks: track the class distribution of outputs. If the model starts over-classifying one category, it's a signal that the prompt or data has drifted.
The Customer Debugging Conversation
Debugging at a customer site is a communication problem as much as a technical one. The customer's engineers believe the AI is 'broken.' Your job is to be methodical, transparent, and non-defensive while you investigate.
- Ask for logs before hypothesizing. 'Can you show me the last 3 failing requests?' prevents 2 hours of speculation. Explain what you're ruling out, not just what you're checking. Customers want to understand the investigation. Give a time estimate for each hypothesis. 'I'll know in 10 minutes if this is the issue.' If you can't find it quickly: scope it. 'This isn't the integration — it's the model provider's behavior on this input class. Here's the workaround while we investigate.'
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →