Evaluation¶
An agent that worked yesterday may not work today — the model changed, a tool was renamed, the prompt got a one-line tweak. locus ships a small evaluation harness so regressions become failing tests, not customer tickets.
from locus.evaluation import EvalCase, EvalRunner
cases = [
EvalCase(
name="weather_lookup",
prompt="What's the weather in NYC?",
expected_tools=["get_weather"],
expected_output_contains=["temperature", "New York"],
max_iterations=5,
),
]
report = EvalRunner(agent=agent).run(cases)
print(report.summary())
When to reach for an eval suite¶
| Situation | Run evals? |
|---|---|
| You changed a tool's signature, default args, or system prompt | yes — every commit that touches it |
| You're swapping models (gpt-4o → gpt-5, llama-3.3 → llama-4) | yes — same suite, two providers, diff the report |
| You're debating "is the agent better than last week?" | yes — nightly soak with n=20 per case to see variance |
| One-shot exploration, scratch agent | no — overhead's not worth it |
| Heavy LLM-as-judge needed (open-ended quality) | the harness covers structural checks; pair it with a custom judge tool for free-text grading |
Getting started¶
1. Define cases¶
EvalCase is a Pydantic model — every field is optional except
name and prompt. The runner only checks fields you set.
from locus.evaluation import EvalCase
books_real = EvalCase(
name="books_real_flight",
prompt="Book TK-12 for customer C-42.",
expected_tools=["book_flight"],
expected_output_contains=["TK-12", "booked"],
max_iterations=4,
)
rejects_unknown = EvalCase(
name="rejects_unknown_flight",
prompt="Book ZZ-999.",
expected_output_contains=["not found"],
expected_output_not_contains=["booked", "confirmed"],
)
2. Run them¶
from locus.evaluation import EvalRunner
runner = EvalRunner(agent=agent)
report = runner.run([books_real, rejects_unknown])
print(report.summary())
# Eval Report: 2/2 passed (avg score: 1.00)
# Total duration: 4321ms
# [PASS] books_real_flight (score: 1.00, 1872ms)
# [PASS] rejects_unknown_flight (score: 1.00, 2449ms)
run() returns an EvalReport — a Pydantic model with per-case
results, aggregate pass/fail counts, average score, and total
duration. JSON-serialisable, drop into CI artifacts.
3. Wire it into CI¶
# tests/test_agent_evals.py
import pytest
from locus.evaluation import EvalRunner
def test_agent_passes_eval_suite(agent):
report = EvalRunner(agent=agent).run(load_cases())
failures = [r for r in report.results if not r.passed]
assert not failures, report.summary()
Built-in checks¶
Every check runs only when the corresponding field is set on the case. Each check contributes equally to the per-case score.
| Field | Passes when |
|---|---|
expected_tools |
All listed tools appear in the run's tool executions. |
expected_output_contains |
Every string is a case-insensitive substring of the final message. |
expected_output_not_contains |
None of the strings appear in the final message. |
max_iterations |
The run finished in ≤ N ReAct turns. |
max_duration_ms |
Wall-clock duration ≤ N milliseconds. |
A case passes when every check passed; the score is the fraction of checks that passed (handy for partial-credit scoring across a soak).
Tags and filtering¶
EvalCase(name="..." , prompt="..." , tags=["smoke", "happy-path"])
EvalCase(name="..." , prompt="..." , tags=["adversarial"])
# Run only smoke cases on every commit; full suite nightly.
smoke = [c for c in all_cases if "smoke" in c.tags]
runner.run(smoke)
tags is just a list — slice it however your CI matrix expects.
LLM-as-judge for open-ended quality¶
The built-in checks are structural ("did the right tool fire?", "did the answer mention 'temperature'?"). For free-text quality ("is this answer empathetic?", "is the explanation correct?"), wrap a judge model as a tool and key on its verdict:
from locus.tools.decorator import tool
@tool
def judge(answer: str) -> dict:
"""LLM-graded quality verdict (0.0–1.0 + reasoning)."""
return judge_model.run_sync(f"Grade this answer: {answer}").message
# Then in the case:
EvalCase(
name="empathetic_response",
prompt="My order is late and I'm upset.",
expected_tools=["judge"],
expected_output_contains=["sorry"], # at minimum
)
A future locus release may bundle a typed judge directly into
EvalCase; for today, this pattern is the path.
Common gotchas¶
| Symptom | Likely cause |
|---|---|
| Case passes locally, fails in CI | Model output varies between runs. Pin the model id, lower temperature, run with n=5 and look at variance. |
max_duration_ms flakes |
Cold-start network latency. Use a wall-clock budget at the suite level, not per-case, or bump the per-case budget by 2×. |
expected_tools reports failure even though the tool ran |
Case-sensitive name match — book_flight != Book_Flight. |
| Score is 0.5 every time | One of two checks is consistently failing. Read result.checks — it carries the full pass/fail map. |
Source and tutorial¶
tutorial_26_evaluation.py— runnable end-to-end suite.locus.evaluation.framework—EvalCase,EvalRunner,EvalReport.
See also¶
- Reasoning —
reflexion=Trueandgrounding=Truereduce the kind of failures you'd otherwise catch only in evals. - Termination —
max_iterationsonEvalCasemirrorsMaxIterationson the agent. - Hooks — record per-eval traces with a
TelemetryHookfor offline review.