Evaluation¶

An agent that worked yesterday may not work today — the model changed, a tool was renamed, the prompt got a one-line tweak. locus ships a small evaluation harness so regressions become failing tests, not customer tickets.

from locus.evaluation import EvalCase, EvalRunner

cases = [
    EvalCase(
        name="weather_lookup",
        prompt="What's the weather in NYC?",
        expected_tools=["get_weather"],
        expected_output_contains=["temperature", "New York"],
        max_iterations=5,
    ),
]

report = EvalRunner(agent=agent).run(cases)
print(report.summary())

When to reach for an eval suite¶

Situation	Run evals?
You changed a tool's signature, default args, or system prompt	yes — every commit that touches it
You're swapping models (gpt-4o → gpt-5, llama-3.3 → llama-4)	yes — same suite, two providers, diff the report
You're debating "is the agent better than last week?"	yes — nightly soak with `n=20` per case to see variance
One-shot exploration, scratch agent	no — overhead's not worth it
Heavy LLM-as-judge needed (open-ended quality)	the harness covers structural checks; pair it with a custom judge tool for free-text grading

Getting started¶

1. Define cases¶

EvalCase is a Pydantic model — every field is optional except name and prompt. The runner only checks fields you set.

from locus.evaluation import EvalCase

books_real = EvalCase(
    name="books_real_flight",
    prompt="Book TK-12 for customer C-42.",
    expected_tools=["book_flight"],
    expected_output_contains=["TK-12", "booked"],
    max_iterations=4,
)

rejects_unknown = EvalCase(
    name="rejects_unknown_flight",
    prompt="Book ZZ-999.",
    expected_output_contains=["not found"],
    expected_output_not_contains=["booked", "confirmed"],
)

2. Run them¶

from locus.evaluation import EvalRunner

runner = EvalRunner(agent=agent)
report = runner.run([books_real, rejects_unknown])

print(report.summary())
# Eval Report: 2/2 passed (avg score: 1.00)
# Total duration: 4321ms
#   [PASS] books_real_flight (score: 1.00, 1872ms)
#   [PASS] rejects_unknown_flight (score: 1.00, 2449ms)

run() returns an EvalReport — a Pydantic model with per-case results, aggregate pass/fail counts, average score, and total duration. JSON-serialisable, drop into CI artifacts.

3. Wire it into CI¶

# tests/test_agent_evals.py
import pytest
from locus.evaluation import EvalRunner

def test_agent_passes_eval_suite(agent):
    report = EvalRunner(agent=agent).run(load_cases())
    failures = [r for r in report.results if not r.passed]
    assert not failures, report.summary()

Built-in checks¶

Every check runs only when the corresponding field is set on the case. Each check contributes equally to the per-case score.

Field	Passes when
`expected_tools`	All listed tools appear in the run's tool executions.
`expected_output_contains`	Every string is a case-insensitive substring of the final message.
`expected_output_not_contains`	None of the strings appear in the final message.
`max_iterations`	The run finished in ≤ N ReAct turns.
`max_duration_ms`	Wall-clock duration ≤ N milliseconds.

A case passes when every check passed; the score is the fraction of checks that passed (handy for partial-credit scoring across a soak).

Tags and filtering¶

EvalCase(name="..." , prompt="..." , tags=["smoke", "happy-path"])
EvalCase(name="..." , prompt="..." , tags=["adversarial"])

# Run only smoke cases on every commit; full suite nightly.
smoke = [c for c in all_cases if "smoke" in c.tags]
runner.run(smoke)

tags is just a list — slice it however your CI matrix expects.

LLM-as-judge for open-ended quality¶

The built-in checks are structural ("did the right tool fire?", "did the answer mention 'temperature'?"). For free-text quality ("is this answer empathetic?", "is the explanation correct?"), wrap a judge model as a tool and key on its verdict:

from locus.tools.decorator import tool

@tool
def judge(answer: str) -> dict:
    """LLM-graded quality verdict (0.0–1.0 + reasoning)."""
    return judge_model.run_sync(f"Grade this answer: {answer}").message

# Then in the case:
EvalCase(
    name="empathetic_response",
    prompt="My order is late and I'm upset.",
    expected_tools=["judge"],
    expected_output_contains=["sorry"],  # at minimum
)

A future locus release may bundle a typed judge directly into EvalCase; for today, this pattern is the path.

Common gotchas¶

Symptom	Likely cause
Case passes locally, fails in CI	Model output varies between runs. Pin the model id, lower `temperature`, run with `n=5` and look at variance.
`max_duration_ms` flakes	Cold-start network latency. Use a wall-clock budget at the suite level, not per-case, or bump the per-case budget by 2×.
`expected_tools` reports failure even though the tool ran	Case-sensitive name match — `book_flight` != `Book_Flight`.
Score is 0.5 every time	One of two checks is consistently failing. Read `result.checks` — it carries the full pass/fail map.

Source and tutorial¶

tutorial_26_evaluation.py — runnable end-to-end suite.
locus.evaluation.framework — EvalCase, EvalRunner, EvalReport.