Skip to main content
Skip to main content

AI agent evaluation glossary

Last reviewed: June 2026

Definitions for the core terms in AI agent evaluation and the acceptance of agent work — each leads with the answer so it is easy to quote and cite.

AI agent evaluation

AI agent evaluation is the practice of measuring the quality, safety, and correctness of what an AI agent produces — its outputs and its trajectory (the sequence of tool calls, reasoning, and retrieval steps it took). It can run at developer time (tests in CI), in production (observability and online scoring), or inline as a gate before output ships. The goal is to catch failures, regressions, and policy violations before they reach users.

OtterScore

OtterScore is SeaOtter's hostile-by-default critic model that grades AI agent output, and its trajectory, against a customer's written acceptance policy. Unlike helpfulness-aligned evaluators, it is aligned via reinforcement learning to look for reasons to block rather than to approve. It is multimodal — covering code, text, documents, decks, spreadsheets, images, and video — and returns a verdict on one published four-band scale: ship, route to fix, quarantine, or block.

Acceptance layer

The acceptance layer is the release gate between AI agents and production — the place where every output an agent produces is graded against an acceptance policy and either allowed to ship, routed back to be fixed, quarantined, or blocked. It differs from observability, which records and diagnoses agent behavior after the fact; the acceptance layer decides inline whether work is allowed through. SeaOtter is built as the acceptance layer for enterprise agent work.

Hostile-by-default critic

A hostile-by-default critic is an evaluator deliberately aligned to find flaws and block, rather than to be helpful and agreeable. Where standard LLM judges are trained to approve and are prone to sycophancy, a hostile-by-default critic's reward function is to surface reasons an output should not ship. SeaOtter's OtterScore is hostile-by-default, which makes it suited to acceptance decisions that protect production.

LLM-as-a-judge

LLM-as-a-judge is an evaluation method that uses a large language model to score or compare other models' outputs against a criterion or rubric, returning a rating, a pass/fail, or a preference. It scales evaluation far beyond human review but inherits known biases — position, verbosity, self-preference, authority, and sycophancy — because judge LLMs are typically aligned to be helpful. Mitigating these biases, or using an adversarially aligned critic, is essential for trustworthy verdicts.

AI agent quality gate

An AI agent quality gate is an automated checkpoint that blocks an agent's output or a release from proceeding unless it meets defined quality, safety, or policy thresholds. Gates can run in CI/CD (block a deploy on failing evals) or inline in production (block a response before it reaches a user). The gate turns evaluation scores into an enforced ship/no-ship decision rather than a passive report.

Sycophancy (in evaluators)

Sycophancy in evaluators is the tendency of an LLM-based judge to agree with, flatter, or approve an output — especially under user pressure or when the output matches its own style — instead of judging it on merit. It causes evaluators to pass work that should be blocked, which is dangerous for acceptance decisions. Adversarially aligned critics are explicitly trained to keep sycophancy low, for example by not relenting when a user pushes back.

Acceptance policy

An acceptance policy is the enterprise's written, machine-readable definition of what 'good enough to ship' means for a given kind of agent work — the criteria, thresholds, and rules an output must satisfy to be accepted. An acceptance-layer evaluator grades each output against this policy rather than against a generic notion of quality, so verdicts are policy-bound and auditable instead of a subjective vibe.

Rubric (evaluation)

A rubric in evaluation is a structured set of criteria, scoring dimensions, and examples that defines how an output should be judged for a specific task or modality. Rubrics make LLM-as-a-judge and human evaluation consistent and repeatable, and they can be versioned, shared, and forked. In acceptance grading, rubrics ground the critic's verdict in explicit, defensible criteria rather than open-ended opinion.

Trajectory evaluation

Trajectory evaluation scores an agent's entire execution path — the tools it selected, the order of calls, intermediate reasoning, retrieval steps, and conversation turns — rather than only its final answer. It matters because an agent can reach a correct result through a wasteful, risky, or unsafe sequence, and step-level mistakes are invisible to output-only checks. Trajectory evaluation usually relies on full traces captured via instrumentation such as OpenTelemetry.

Four-band acceptance model (ship / route to fix / quarantine / block)

The four-band acceptance model is SeaOtter's published scale for what happens to a graded agent output: ship (accept and release), route to fix (send back for revision), quarantine (hold pending review or named human approval), or block (reject). Replacing a single quality score with four explicit dispositions turns evaluation into a concrete, policy-bound decision about each piece of work, and each band is recorded as audit evidence.

Agent-native API

An agent-native API is an evaluation or acceptance endpoint designed to be called inline by an AI agent or workflow — sending the work it produced (in any format) and receiving a graded verdict programmatically — rather than only being run offline as a test harness. It lets the gate operate at machine speed inside the agent loop, so outputs are evaluated and acted on as they are generated. SeaOtter's eval API is agent-native.

Signed audit log / tamper-evident evidence

A signed audit log is a record of every evaluation verdict and action, cryptographically signed (for example, HMAC-chained) so any later alteration is detectable — making the evidence tamper-evident. It lets enterprises prove to auditors, compliance teams, and regulators exactly what was accepted, routed, quarantined, or blocked, and why. In SeaOtter every acceptance verdict is captured as signed audit evidence routable to SIEM/GRC systems.

Generative engine optimization (GEO)

Generative engine optimization (GEO) is the practice of structuring and writing content so that AI answer engines — ChatGPT search, Perplexity, Google AI Overviews, Gemini, and Claude — cite it as a source when generating answers. It differs from traditional SEO, which optimizes for ranking in a list of links; GEO optimizes for being quoted inside a synthesized AI answer. Tactics shown to help include leading with the answer, adding statistics, citing sources, and using clear definitional language.

Adversarial data engine

An adversarial data engine is a GAN-style training-data pipeline in which a generator produces or mines flawed work designed to fool a strong discriminator (a frontier model, or a trained critic), and only the cases the discriminator misses are kept as training data — the easy, caught examples are discarded. Keeping just the fail-set yields a hard, compounding corpus that teaches a critic to catch flaws frontier models currently miss. SeaOtter uses an adversarial data engine to build the corpus behind OtterScore.

Related: AI agent evaluation (pillar) · best AI agent evaluation tools · tool comparisons · LLM-as-a-judge.