Guardrails for LLMs: Input/Output Filtering in Production
How guardrail pipelines work — input classifiers, output validators, topic filters, PII redaction, and toxicity detection. What fails at scale.
Guardrails are the safety layer between your users and your model. They intercept inputs before they reach the LLM and outputs before they reach the user, filtering, transforming, or blocking content that violates your policies.
Input guardrails
- Topic filter: block off-topic or out-of-scope queries before they consume tokens
- PII detector: identify and redact phone numbers, emails, SSNs, credit cards before sending to the model
- Injection detector: classify whether input looks like a prompt injection attempt
- Toxicity classifier: block abusive inputs using a fast binary classifier
- Jailbreak detector: catch common jailbreak patterns (DAN, role-play escapes, encoding tricks)
def check_input(user_message: str) -> tuple[bool, str]:
"""Returns (is_allowed, reason)"""
# 1. PII detection (fast regex + NER)
if contains_pii(user_message):
return False, "pii_detected"
# 2. Topic relevance (small classifier, <10ms)
if not is_on_topic(user_message, allowed_topics=["product", "support"]):
return False, "off_topic"
# 3. Injection risk (embedding similarity to known attacks)
if injection_score(user_message) > 0.85:
return False, "injection_detected"
return True, "ok"
Output guardrails
- Hallucination check: verify claims against retrieved context using an NLI model
- PII leak detector: ensure model didn't reproduce PII from context into the response
- Toxicity filter: block harmful outputs before delivery
- Format validator: ensure structured outputs match the expected schema
- Citation checker: verify that cited sources actually support the claims made
Architecture: where to place guardrails
Guardrails can run synchronously (blocking — adds latency) or asynchronously (non-blocking — you deliver the response and log violations for review). For safety-critical applications, synchronous input + output checks are mandatory. For high-volume consumer applications, async output checking with human review is more practical.
The fastest guardrails run in 5–20ms (regex, small classifiers). The most accurate run in 100–500ms (LLM-based judges). Design your pipeline to run fast checks first and only invoke expensive checks when the cheap ones raise flags.
Off-the-shelf vs. custom
| Option | Latency | Accuracy | Customisability |
|---|---|---|---|
| Llama Guard (Meta) | 50–200ms | Good for common categories | Fine-tuneable |
| Azure Content Safety | 100–300ms | Strong on CSAM, violence, hate | Limited |
| Guardrails AI | Varies | Modular, schema validation | High — composable |
| NeMo Guardrails | 100–400ms | Dialogue flows + policies | High |
| Custom classifier | 5–50ms | Best for domain-specific | Full control |
The cost of guardrails — latency budget
Guardrails add latency. A full synchronous pipeline (input check → LLM → output check) can add 100–600ms depending on which classifiers you use. For real-time chat this is often unacceptable. The solution: run fast synchronous checks (regex, small classifier, <20ms) and offload slow checks (LLM judge, NLI model) to async post-processing that logs violations for review. Only synchronously block on high-confidence, high-severity signals.
| Guardrail type | Latency | Sync or async? | Use for |
|---|---|---|---|
| Regex / pattern match | <1ms | Sync | PII, obvious injection patterns |
| Small classifier (DistilBERT) | 5–20ms | Sync | Toxicity, topic filter, jailbreak |
| Llama Guard | 50–200ms | Sync (critical) / async (standard) | Safety categories |
| LLM-as-judge | 300–800ms | Async only | Hallucination check, faithfulness |
Explore guardrails in Concepts →: See input and output filtering in action on the platform.
→ Interactive: The AI Guardrails module in Systems Lab walks through guardrail patterns, failure modes, and decision frameworks.
- Constitutional AI: Harmlessness from AI Feedback — Anthropic, 2022
- NVIDIA NeMo Guardrails — GitHub
- Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (Inan et al., 2023)
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →