Safety, guardrails, and steering¶
Three layers cooperate inside an agent run:
- Validation — typed tool arguments are JSON-schema-checked before the call lands. No opt-in needed.
- Guardrails — content policy, PII redaction, dangerous-tool blocking, prompt/result length caps. Runs as a hook on the prompt-in / output-out boundaries.
- Steering — a second model votes on every tool call before it fires. The judge sees the system prompt, the user goal, and the tool-call arguments, and emits approve / reject / rewrite.
Each layer plugs in independently. You can turn one on without the others.
When to reach for which layer¶
| Situation | Layer |
|---|---|
| Tool args from the model are sometimes malformed | Validation — already on; nothing to do |
| Public-facing agent — block prompt injection, SQL/command/path-traversal patterns, cap input length | GuardrailsHook with the default GuardrailConfig |
| Customer-facing answer where leaking PII (emails, SSN, credit cards, IPs) is a compliance issue | GuardrailsHook with PII patterns enabled |
High-stakes tools (send_email, transfer_funds, delete_*) — want a second model to sanity-check the call |
SteeringHook with a judge model and a policy string |
| Domain restriction — "the user came in for flights, reject anything else" | SteeringHook with that policy verbatim |
| Internal-only agent, trusted prompts, low-stakes tools | none of the above; default validation is enough |
Getting started¶
Guardrails — block dangerous tools and redact PII¶
from locus import Agent
from locus.hooks.builtin.guardrails import (
GuardrailsHook, GuardrailConfig, GuardrailAction,
)
config = GuardrailConfig(
block_dangerous_tools=frozenset({"shell", "exec", "rm", "drop"}),
max_prompt_length=50_000,
default_action=GuardrailAction.BLOCK,
)
agent = Agent(
model="oci:openai.gpt-5.5",
tools=[search, summarise],
hooks=[GuardrailsHook(config=config)],
)
GuardrailsHook ships with sensible defaults — the empty
GuardrailConfig() already blocks eval, exec, system, shell,
rm, delete, drop, truncate; detects email / phone / SSN /
credit-card / IP patterns; and watches for SQL-injection,
path-traversal, and command-injection shapes in tool inputs.
Topic and content policies — domain restriction¶
from locus.hooks.builtin.guardrails import (
GuardrailsHook, TopicPolicy, ContentPolicy,
)
topic_policy = TopicPolicy(
blocked_topics={"weapons", "hacking"},
keywords={
"weapons": ["gun", "rifle", "ammunition"],
"hacking": ["exploit", "zero-day", "rootkit"],
},
)
content_policy = ContentPolicy(
enabled_categories={"hate_speech", "self_harm", "illegal_activity"},
)
agent = Agent(
model="oci:openai.gpt-5.5",
tools=[...],
hooks=[GuardrailsHook(
config=GuardrailConfig(),
topic_policy=topic_policy,
content_policy=content_policy,
)],
)
Both policies are simple keyword classifiers — fast, predictable,
auditable. For production-grade content moderation, swap in an
ML-backed policy (Oracle Content Moderation, OpenAI Moderation, etc.)
behind the same Policy.check(text) -> str | None shape.
Steering — a second model judges every tool call¶
from locus.hooks.builtin.steering import SteeringHook
agent = Agent(
model="oci:openai.gpt-5.5",
tools=[search_flights, send_email, transfer],
hooks=[
SteeringHook(
judge_model="oci:openai.gpt-5-mini",
policy=(
"The user came in to book a flight. "
"Reject any tool call unrelated to flights."
),
),
],
)
Before send_email or transfer fires, the judge sees the system
prompt, the user goal, and the proposed tool call. Three possible
verdicts:
- approve — the call goes through.
- reject — the call is replaced with an error the model sees, triggering a re-plan.
- rewrite — the judge can hand back modified arguments (for scoping a query, redacting a recipient, etc).
Use the smallest model that gives reliable verdicts — a mini /
flash / haiku is usually enough.
Validation (you don't have to do anything)¶
The @tool decorator builds a JSON schema from the function's typed
signature. Every model tool call goes through that schema before the
function body runs. Schema violations come back to the model as a
tool error so it can retry with corrected arguments — you don't have
to write any of that defensively.
A model call with seat_class="business" is rejected before the body
runs; the model sees the typed-error message and retries with "C".
Common gotchas¶
| Symptom | Likely cause |
|---|---|
| PII redaction over-aggressive | The default IP regex matches version strings too. Drop ip_address from pii_patterns or tighten to a CIDR-aware pattern. |
| Steering rejects almost everything | Judge model is too strict. Tune the policy or move to a stronger model — a nano is often too small for nuanced judgement. |
GuardrailsHook blocks a legitimate message |
Inspect hook._violations after the run for the violation type, then add an action override (action_overrides={"sql_injection": ALLOW}) or trim the regex. |
| Validation error swallows a tool-arg bug | The error came back to the model — it's in the trace, look for ToolCompleteEvent.error. |
Source and tutorials¶
tutorial_19_guardrails_security.py— basic guardrails.tutorial_30_guardrails_advanced.py— topic + content + PII layered.tutorial_33_steering.py— judge-model approval.locus.hooks.builtin.guardrailslocus.hooks.builtin.steering
See also¶
- Hooks — how
GuardrailsHookandSteeringHookplug into the lifecycle. - Tools — the
@tooldecorator and its schema validation. - Reasoning: grounding — the answer-side analogue, claim-by-claim.