Deterministic Guardrails: When to Use Hooks and When to Trust the LLM
Rule-based hooks and LLM-based safety serve different threat models. Hooks are fast, predictable, and auditable — but brittle on edge cases. LLM classifiers generalise better but add latency and fail probabilistically. The right architecture uses both in the right layers.
When something happens in your agent — a tool is called, a response is generated, an external message arrives — you have two ways to inspect and potentially modify it: a deterministic hook or an LLM-based classifier. Choosing wrong adds latency, costs money, or lets bad outputs through. Choosing right is a systems design decision, not a preference.
What hooks are
Hooks are deterministic code that executes at specific points in the agent lifecycle. Pre-tool hooks fire before a tool is called — you can inspect the tool name and arguments and block or modify the call. Post-model hooks fire after the model generates output — you can inspect the response and reject, modify, or log it. Input hooks fire on every incoming message before the model sees it.
Hooks are fast (microseconds), predictable (same input always produces same output), and auditable (you can log exactly what was blocked and why). Their weakness: they operate on patterns, not meaning. A hook can block any response containing a competitor's name. It cannot understand whether a response is actually harmful in context.
What LLM classifiers are
An LLM-based safety classifier is a separate model call that evaluates a piece of content against safety criteria. It can understand context, nuance, and intent in ways a rule-based hook cannot. It catches adversarial inputs that are specifically designed to evade pattern matching. It handles novel threat vectors that were not anticipated when the hooks were written.
LLM classifiers add 100-500ms of latency per check and cost money per call. They fail probabilistically — a well-crafted adversarial input can still fool an LLM classifier. They are harder to audit: when a classifier blocks something, explaining exactly why in deterministic terms is difficult.
The layered architecture
- Layer 1 — Input hooks: fast pattern matching on incoming messages. Block known bad patterns (prompt injection templates, PII regexes, known jailbreak prefixes) before the model sees them. Cost: ~0ms.
- Layer 2 — LLM input classifier: semantic safety check on inputs that passed layer 1. Catches novel attacks, contextual violations, subtle policy breaches. Cost: 100-200ms, $0.001-0.01 per check.
- Layer 3 — Output hooks: structural validation on model responses. Format compliance, length limits, forbidden content patterns. Cost: ~0ms.
- Layer 4 — LLM output classifier: faithfulness and safety check on responses before they reach the user. Catches hallucination, policy violations in generated content. Use on a sampled basis (1-10%) or for high-stakes responses only.
The design decision
The right architecture depends on your threat model and latency budget. If your primary risk is known bad patterns (competitor names, PII categories, known injection templates), hooks handle this at zero latency cost. If your risk is adversarial users actively probing for policy violations, you need LLM classifiers on the input path. If you serve high-stakes outputs (medical, legal, financial), LLM output classification on 100% of responses may be justified despite the cost.
A common mistake: replacing deterministic hooks entirely with LLM classifiers because 'the LLM is smarter'. The LLM is smarter on hard cases. On easy cases (block any message mentioning a banned keyword), hooks are faster, cheaper, and more auditable. Use each layer for what it is actually good at.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →