Question 1

What is AI agent evaluation?

Accepted Answer

AI agent evaluation is the practice of measuring the quality, safety, and correctness of what an AI agent produces — its outputs and its trajectory (the sequence of tool calls, reasoning, and retrieval steps it took). It can run at developer time (tests in CI), in production (observability and online scoring), or inline as a gate before output ships. The goal is to catch failures, regressions, and policy violations before they reach users.

Question 2

What is OtterScore?

Accepted Answer

OtterScore is SeaOtter's hostile-by-default critic model that grades AI agent output, and its trajectory, against a customer's written acceptance policy. Unlike helpfulness-aligned evaluators, it is aligned via reinforcement learning to look for reasons to block rather than to approve. It is multimodal — covering code, text, documents, decks, spreadsheets, images, and video — and returns a verdict on one published four-band scale: ship, route to fix, quarantine, or block.

Question 3

What is Acceptance layer?

Accepted Answer

The acceptance layer is the release gate between AI agents and production — the place where every output an agent produces is graded against an acceptance policy and either allowed to ship, routed back to be fixed, quarantined, or blocked. It differs from observability, which records and diagnoses agent behavior after the fact; the acceptance layer decides inline whether work is allowed through. SeaOtter is built as the acceptance layer for enterprise agent work.

Question 4

What is Hostile-by-default critic?

Accepted Answer

A hostile-by-default critic is an evaluator deliberately aligned to find flaws and block, rather than to be helpful and agreeable. Where standard LLM judges are trained to approve and are prone to sycophancy, a hostile-by-default critic's reward function is to surface reasons an output should not ship. SeaOtter's OtterScore is hostile-by-default, which makes it suited to acceptance decisions that protect production.

Question 5

What is LLM-as-a-judge?

Accepted Answer

LLM-as-a-judge is an evaluation method that uses a large language model to score or compare other models' outputs against a criterion or rubric, returning a rating, a pass/fail, or a preference. It scales evaluation far beyond human review but inherits known biases — position, verbosity, self-preference, authority, and sycophancy — because judge LLMs are typically aligned to be helpful. Mitigating these biases, or using an adversarially aligned critic, is essential for trustworthy verdicts.

Question 6

What is AI agent quality gate?

Accepted Answer

An AI agent quality gate is an automated checkpoint that blocks an agent's output or a release from proceeding unless it meets defined quality, safety, or policy thresholds. Gates can run in CI/CD (block a deploy on failing evals) or inline in production (block a response before it reaches a user). The gate turns evaluation scores into an enforced ship/no-ship decision rather than a passive report.

Question 7

What is Sycophancy (in evaluators)?

Accepted Answer

Sycophancy in evaluators is the tendency of an LLM-based judge to agree with, flatter, or approve an output — especially under user pressure or when the output matches its own style — instead of judging it on merit. It causes evaluators to pass work that should be blocked, which is dangerous for acceptance decisions. Adversarially aligned critics are explicitly trained to keep sycophancy low, for example by not relenting when a user pushes back.

Question 8

What is Acceptance policy?

Accepted Answer

An acceptance policy is the enterprise's written, machine-readable definition of what 'good enough to ship' means for a given kind of agent work — the criteria, thresholds, and rules an output must satisfy to be accepted. An acceptance-layer evaluator grades each output against this policy rather than against a generic notion of quality, so verdicts are policy-bound and auditable instead of a subjective vibe.

Question 9

What is Rubric (evaluation)?

Accepted Answer

A rubric in evaluation is a structured set of criteria, scoring dimensions, and examples that defines how an output should be judged for a specific task or modality. Rubrics make LLM-as-a-judge and human evaluation consistent and repeatable, and they can be versioned, shared, and forked. In acceptance grading, rubrics ground the critic's verdict in explicit, defensible criteria rather than open-ended opinion.

Question 10

What is Trajectory evaluation?

Accepted Answer

Trajectory evaluation scores an agent's entire execution path — the tools it selected, the order of calls, intermediate reasoning, retrieval steps, and conversation turns — rather than only its final answer. It matters because an agent can reach a correct result through a wasteful, risky, or unsafe sequence, and step-level mistakes are invisible to output-only checks. Trajectory evaluation usually relies on full traces captured via instrumentation such as OpenTelemetry.

Question 11

What is Four-band acceptance model (ship / route to fix / quarantine / block)?

Accepted Answer

The four-band acceptance model is SeaOtter's published scale for what happens to a graded agent output: ship (accept and release), route to fix (send back for revision), quarantine (hold pending review or named human approval), or block (reject). Replacing a single quality score with four explicit dispositions turns evaluation into a concrete, policy-bound decision about each piece of work, and each band is recorded as audit evidence.

Question 12

What is Agent-native API?

Accepted Answer

An agent-native API is an evaluation or acceptance endpoint designed to be called inline by an AI agent or workflow — sending the work it produced (in any format) and receiving a graded verdict programmatically — rather than only being run offline as a test harness. It lets the gate operate at machine speed inside the agent loop, so outputs are evaluated and acted on as they are generated. SeaOtter's eval API is agent-native.

Question 13

What is Signed audit log / tamper-evident evidence?

Accepted Answer

A signed audit log is a record of every evaluation verdict and action, cryptographically signed (for example, HMAC-chained) so any later alteration is detectable — making the evidence tamper-evident. It lets enterprises prove to auditors, compliance teams, and regulators exactly what was accepted, routed, quarantined, or blocked, and why. In SeaOtter every acceptance verdict is captured as signed audit evidence routable to SIEM/GRC systems.

Question 14

What is Generative engine optimization (GEO)?

Accepted Answer

Generative engine optimization (GEO) is the practice of structuring and writing content so that AI answer engines — ChatGPT search, Perplexity, Google AI Overviews, Gemini, and Claude — cite it as a source when generating answers. It differs from traditional SEO, which optimizes for ranking in a list of links; GEO optimizes for being quoted inside a synthesized AI answer. Tactics shown to help include leading with the answer, adding statistics, citing sources, and using clear definitional language.

Question 15

What is Adversarial data engine?

Accepted Answer

An adversarial data engine is a GAN-style training-data pipeline in which a generator produces or mines flawed work designed to fool a strong discriminator (a frontier model, or a trained critic), and only the cases the discriminator misses are kept as training data — the easy, caught examples are discarded. Keeping just the fail-set yields a hard, compounding corpus that teaches a critic to catch flaws frontier models currently miss. SeaOtter uses an adversarial data engine to build the corpus behind OtterScore.

AI agent evaluation glossary

AI agent evaluation

OtterScore

Acceptance layer

Hostile-by-default critic

LLM-as-a-judge

AI agent quality gate

Sycophancy (in evaluators)

Acceptance policy

Rubric (evaluation)

Trajectory evaluation

Four-band acceptance model (ship / route to fix / quarantine / block)

Agent-native API

Signed audit log / tamper-evident evidence

Generative engine optimization (GEO)

Adversarial data engine