Evaluation¶
EvalCase ¶
Bases: BaseModel
A single evaluation test case.
Defines what to send to the agent and what to expect back.
Example
case = EvalCase( ... name="weather_lookup", ... prompt="What's the weather in NYC?", ... expected_tools=["get_weather"], ... expected_output_contains=["temperature", "New York"], ... max_iterations=5, ... )
EvalResult ¶
Bases: BaseModel
Result from evaluating a single case.
EvalReport ¶
Bases: BaseModel
Aggregated report from running an eval suite.
summary ¶
Generate a human-readable summary.
Source code in src/locus/evaluation/framework.py
EvalRunner ¶
Run evaluation cases against an agent.
Example
runner = EvalRunner(agent=my_agent) report = runner.run( ... [ ... EvalCase( ... name="basic", prompt="Hello", expected_output_contains=["hello"] ... ), ... EvalCase( ... name="tool_use", prompt="Search for X", expected_tools=["search"] ... ), ... ] ... ) print(report.summary())
Source code in src/locus/evaluation/framework.py
run ¶
Run all eval cases and produce a report.