AI Engineering 9 min read

Guardrails for LLMs: Input/Output Filtering in Production

How guardrail pipelines work — input classifiers, output validators, topic filters, PII redaction, and toxicity detection. What fails at scale.

Guardrails are the safety layer between your users and your model. They intercept inputs before they reach the LLM and outputs before they reach the user, filtering, transforming, or blocking content that violates your policies.

Input guardrails

Topic filter: block off-topic or out-of-scope queries before they consume tokens
PII detector: identify and redact phone numbers, emails, SSNs, credit cards before sending to the model
Injection detector: classify whether input looks like a prompt injection attempt
Toxicity classifier: block abusive inputs using a fast binary classifier
Jailbreak detector: catch common jailbreak patterns (DAN, role-play escapes, encoding tricks)

def check_input(user_message: str) -> tuple[bool, str]:
    """Returns (is_allowed, reason)"""

    # 1. PII detection (fast regex + NER)
    if contains_pii(user_message):
        return False, "pii_detected"

    # 2. Topic relevance (small classifier, <10ms)
    if not is_on_topic(user_message, allowed_topics=["product", "support"]):
        return False, "off_topic"

    # 3. Injection risk (embedding similarity to known attacks)
    if injection_score(user_message) > 0.85:
        return False, "injection_detected"

    return True, "ok"

Output guardrails

Hallucination check: verify claims against retrieved context using an NLI model
PII leak detector: ensure model didn't reproduce PII from context into the response
Toxicity filter: block harmful outputs before delivery
Format validator: ensure structured outputs match the expected schema
Citation checker: verify that cited sources actually support the claims made

Architecture: where to place guardrails

Guardrails can run synchronously (blocking — adds latency) or asynchronously (non-blocking — you deliver the response and log violations for review). For safety-critical applications, synchronous input + output checks are mandatory. For high-volume consumer applications, async output checking with human review is more practical.

The fastest guardrails run in 5–20ms (regex, small classifiers). The most accurate run in 100–500ms (LLM-based judges). Design your pipeline to run fast checks first and only invoke expensive checks when the cheap ones raise flags.

Off-the-shelf vs. custom

Option	Latency	Accuracy	Customisability
Llama Guard (Meta)	50–200ms	Good for common categories	Fine-tuneable
Azure Content Safety	100–300ms	Strong on CSAM, violence, hate	Limited
Guardrails AI	Varies	Modular, schema validation	High — composable
NeMo Guardrails	100–400ms	Dialogue flows + policies	High
Custom classifier	5–50ms	Best for domain-specific	Full control

The cost of guardrails — latency budget

Guardrails add latency. A full synchronous pipeline (input check → LLM → output check) can add 100–600ms depending on which classifiers you use. For real-time chat this is often unacceptable. The solution: run fast synchronous checks (regex, small classifier, <20ms) and offload slow checks (LLM judge, NLI model) to async post-processing that logs violations for review. Only synchronously block on high-confidence, high-severity signals.

Guardrail type	Latency	Sync or async?	Use for
Regex / pattern match	<1ms	Sync	PII, obvious injection patterns
Small classifier (DistilBERT)	5–20ms	Sync	Toxicity, topic filter, jailbreak
Llama Guard	50–200ms	Sync (critical) / async (standard)	Safety categories
LLM-as-judge	300–800ms	Async only	Hallucination check, faithfulness

Explore guardrails in Concepts →: See input and output filtering in action on the platform.

→ Interactive: The AI Guardrails module in Systems Lab walks through guardrail patterns, failure modes, and decision frameworks.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →