Skip to content

GSAR — typed grounding

Imagine an incident-response agent that pulls three log lines, two metric points, and one alert. Every fact it cites is real. But the synthesis says "the outage was caused by a config push," and the evidence only supports "a config push happened thirty seconds before the outage." Vanilla Agent(grounding=True) — a single LLM-as-judge scalar over the answer as a whole — often misses this. Each claim is grounded; the conclusion over-reaches.

GSAR (Grounding-Stratified Adaptive Replanning, from Kamelhar 2026) is the upgrade. It breaks the synthesis into claims, partitions them four ways, scores the partition with per-evidence-type weights, and picks one of three responses: proceed if the synthesis holds up, regenerate if the evidence is fine but the wording is loose (rewrites without re-running tools), or replan if the evidence itself is missing or contradicted. The math is small (one equation), the integration is one Pydantic type, and six monotonicity / adversarial-robustness properties are formally provable.

Use GSAR for high-stakes pipelines — operational incidents, regulated diagnostics, anything where the "evidence fine, conclusion loose" failure mode is a real cost. Use vanilla grounding=True for everything else; binary verdicts are cheaper and good enough most of the time.

What it adds

Vanilla grounding GSAR
Output is_grounded ∈ {true, false} + scalar s ∈ [0, 1] Four-way partition G ⊔ U ⊔ X ⊔ K, scalar S, abstain channel
Evidence weighting Uniform Per-type weights w: T → [0, 1] (tool_match weighted higher than inference)
Recovery Binary {stop, replan} Three-tier {proceed, regenerate, replan} — middle tier rewrites the synthesis without re-running expensive tools
Adversarial robustness Score inflates if a contradicted claim is silently dropped Asymmetric contradiction penalty ρ keeps X in the denominator
Budget Implicit Explicit K_max replan budget, degraded flag on exhaustion

The score

For a claim partition — G grounded, U ungrounded, X contradicted, K complementary — an evidence-type weight map w, and a contradiction penalty ρ ∈ [0, 1], the GSAR score is:

$$ S = \frac{W(\mathcal G) + W(\mathcal K)}{W(\mathcal G) + W(\mathcal U) + \rho \cdot W(\mathcal X) + W(\mathcal K)} $$

where W(P) = Σ_{c ∈ P} w(type(c)). On the empty partition S = 0.5 (epistemic indifference). The score lives in [0, 1]; six monotonicity and adversarial-robustness properties are proven in Appendix A of the paper and locked under unit tests in tests/unit/test_gsar.py.

The decision

δ(s) = proceed     if s ≥ τ_proceed
δ(s) = regenerate  if τ_regenerate ≤ s < τ_proceed
δ(s) = replan      if s < τ_regenerate

The reference thresholds are τ_proceed = 0.80, τ_regenerate = 0.65 (Appendix B); the paper recommends per-deployment recalibration on a small (100–200) human-graded held-out set. The regenerate tier is the critical middle band — it rewrites the synthesis without re-dispatching the specialists, catching the "evidence is fine, synthesis is loose" mode that dominates real production logs.

Wiring

from locus.models.native.openai import OpenAIModel
from locus.reasoning.gsar import GSARThresholds
from locus.reasoning.gsar_evaluator import GSAREvaluator
from locus.reasoning.gsar_judge import JudgeOutput, StructuredOutputGSARJudge

judge = StructuredOutputGSARJudge(
    model=OpenAIModel(model="gpt-4o-mini", max_tokens=2048),
)

async def regenerate(synthesis: str, judge_output: JudgeOutput) -> str:
    """Cheap branch: rewrite synthesis from existing evidence."""
    ...

async def replan(synthesis: str, evidence: str, jo: JudgeOutput) -> tuple[str, str]:
    """Expensive branch: revise plan, re-dispatch specialists."""
    ...

evaluator = GSAREvaluator(
    judge=judge,
    regenerate_fn=regenerate,
    replan_fn=replan,
    thresholds=GSARThresholds(),     # Appendix-B defaults
    contradiction_penalty=0.5,        # ρ
    k_max=2,                          # bounded replan budget
)

result = await evaluator.evaluate(
    report_synthesis=initial_report,
    evidence_corpus=evidence,
)
# result.final_report, result.final_score, result.final_decision,
# result.trajectory  (every iteration logged for audit)
# result.degraded    (True when the budget exhausted without proceed)

The evaluator runs Algorithm 1 from the paper to convergence (δ = proceed) or budget exhaustion (degraded = True, returning a "degraded but honest" report rather than looping indefinitely or silently shipping un-grounded claims).

Evidence taxonomy

The default EvidenceType enum mirrors the paper's reference instantiation (Appendix B). Tool-side annotations populate it; in production you'd map your tool taxonomy onto these.

Evidence type When to use it Default weight
tool_match Claim directly traceable to a tool output row 1.00
specific_data Cites a structured field of a step output 0.95
signal_match References the originating alert / signal 0.90
complementary_finding Non-redundant alternative perspective 0.85
synthesis Cross-specialist combination 0.80
neg_evidence Absence-of-signal observation 0.70
inference Model-internal inference, no tool support 0.60
domain Textbook / runbook fact 0.60

When to use GSAR vs vanilla grounding=True

  • Vanilla is right for most tasks. Binary verdict, one scalar, cheap. If you're not in a regulated / safety-critical setting, start here.
  • GSAR is right when (a) the cost of a wrong "ship it" decision outweighs the extra LLM judge call, (b) you want auditable per-claim evidence-type provenance in your checkpoint stream, or (c) your synthesis layer can plausibly be looser than the underlying evidence (the regenerate tier earns its keep here).

Source and tests