Skip to content

Evaluation

Treat the agent like any other piece of code. Declare cases, run them, read the report. Locus ships a small, dependency-free harness so you don't need an external eval framework for the common cases.

  • EvalCase declares prompt plus expected substrings (positive or negative).
  • EvalRunner runs the agent against every case.
  • EvalReport summarises pass/fail counts and an average score.

Run it (OCI Generative AI is the default; auto-detected from ~/.oci/config):

python examples/notebook_55_evaluation.py

Offline:

LOCUS_MODEL_PROVIDER=mock python examples/notebook_55_evaluation.py

Source

# Copyright (c) 2025, 2026 Oracle and/or its affiliates.
# Licensed under the Universal Permissive License v1.0 as shown at
# https://oss.oracle.com/licenses/upl/
"""Notebook 50: Evaluation — score an agent against a test suite.

Treat the agent like any other piece of code. Declare cases, run them,
read the report. Locus ships a small, dependency-free harness so you
don't need an external eval framework for the common cases.

- EvalCase declares prompt plus expected substrings (positive or negative).
- EvalRunner runs the agent against every case.
- EvalReport summarises pass/fail counts and an average score.

Run it
    # Default: OCI Generative AI auto-detected from ~/.oci/config
    python examples/notebook_55_evaluation.py

    # Offline / no credentials:
    LOCUS_MODEL_PROVIDER=mock python examples/notebook_55_evaluation.py
"""

from config import get_model

from locus.agent import Agent, AgentConfig
from locus.evaluation import EvalCase, EvalRunner


def example_evaluation():
    """Run a systematic evaluation of an agent."""
    print("=== Agent Evaluation ===\n")

    model = get_model()

    agent = Agent(
        config=AgentConfig(
            system_prompt="You are a helpful assistant. Answer concisely.",
            max_iterations=3,
            model=model,
        )
    )

    cases = [
        EvalCase(
            name="basic_knowledge",
            prompt="What is the capital of France?",
            expected_output_contains=["paris"],
            max_iterations=3,
        ),
        EvalCase(
            name="math",
            prompt="What is 15 * 7?",
            expected_output_contains=["105"],
        ),
        EvalCase(
            name="no_hallucination",
            prompt="What is the capital of France?",
            expected_output_not_contains=["berlin", "london"],
        ),
    ]

    runner = EvalRunner(agent=agent)
    report = runner.run(cases)

    print(report.summary())
    print(f"\nTotal: {report.total_cases}, Passed: {report.passed}, Failed: {report.failed}")
    print(f"Average score: {report.avg_score:.2f}")


if __name__ == "__main__":
    example_evaluation()