AI agent evaluation
How to evaluate and gate AI agent output — code, text, documents, decks, spreadsheets, images, and video — against an acceptance policy before it ships.
The problem
AI agents now produce work faster than any team can review it. Most models are aligned to be helpful and agreeable, so an agent tends to approve its own output. At scale that means unreviewed agent work reaches production. AI agent evaluation closes that gap: it grades each output — and the trajectory that produced it — against an explicit acceptance policy, so only work that clears the bar ships.
The four-band acceptance model
OtterScore is a hostile-by-default critic: aligned to find reasons to block, not to flatter. It returns a score (0.0–1.0, where 1.0 = ship and 0.0 = must block; lower means more flawed) and one of four bands, so the decision is policy-bound, not a vibe:
ship— meets the policy; accept it.route_to_fix— close, but send it back with located flaws and concrete upgrades.quarantine— hold for review; do not ship yet.block— fails the policy; must not reach production.
Evaluate your work in one call
It is agent-native — an agent can self-onboard with no human. The canonical machine-readable contract is /llms.txt.
# 1. get a free key (no human in the loop)
curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
-H 'Content-Type: application/json' -d '{"email":"you@example.com"}'
# 2. evaluate (async — tolerates the GPU cold-start)
curl -s https://api.seaotter.ai/api/v1/eval/jobs \
-H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
-d '{"submission":"async","user_prompt":"<what the work was for>",
"artifact_parts":[{"mime_type":"text/plain","text":"<your work>"}]}'
# 3. poll until completed (warm = seconds; a cold GPU can take a few minutes)
curl -s https://api.seaotter.ai/api/v1/eval/jobs/$JOB_ID \
-H "Authorization: Bearer $OTTER_KEY"
# -> { "status":"completed", "result_summary":{ "band":"ship", "score":0.95 } }Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp. Whole-workflow scoring (a topology-aware composite plus per-step critique) is one more endpoint — see /docs/agent-native and /developers.
How it differs from an LLM-as-a-judge
- Adversarial, not agreeable. Trained on agent work that fooled a strong discriminator, then aligned to push back.
- Policy- and rubric-bound. Graded against your acceptance criteria, not a generic notion of quality.
- Output and trajectory. It evaluates how the work was produced, not just the final artifact.
- Multimodal and auditable. One published band across modalities, with a signed verdict for every decision.
Frequently asked questions
What is AI agent evaluation?
AI agent evaluation is the practice of grading an AI agent's output — and the trajectory it took to get there — against an explicit acceptance policy before that work is shipped. It answers a yes/no question downstream teams care about: is this good enough to accept, or does it need to be fixed, quarantined, or blocked?
How do you evaluate AI agent output before it ships?
Send the artifact (code, text, a document, a deck, a spreadsheet, an image, or video) plus the task it was for to an evaluation API. OtterScore returns a score from 0.0 to 1.0, a band (ship / route_to_fix / quarantine / block), the specific flaws and where they occur, and concrete upgrades — in one call, with no human in the loop.
What is an AI agent quality gate?
A quality gate is an automated checkpoint between an agent and production that blocks work which fails your acceptance policy and routes it back to be fixed. Instead of relying on a model that is aligned to approve its own output, the gate runs a hostile-by-default critic and records a signed audit verdict for every decision.
Is OtterScore just an LLM-as-a-judge?
No. A general LLM-as-a-judge is aligned to be helpful and agreeable, so it tends to approve. OtterScore is aligned the opposite way — to find reasons to block — and is trained adversarially on agent work that fooled a strong discriminator. It grades against your policy and rubric, evaluates the trajectory as well as the output, and is multimodal.
How much does AI agent evaluation cost?
There is a free tier: an agent can self-onboard for an API key and get a free quota of grades on the hosted critic, with metered usage after that. Per-policy local critics serve at roughly £0.001–0.03 per artifact versus around £0.30 on a frontier judge.
Can it evaluate code, images, and whole workflows?
Yes. Evaluation is multimodal — code, text, documents, decks, spreadsheets, images, and video — and there is a workflow endpoint that scores an end-to-end agent pipeline with a topology-aware composite plus a per-step critique.
Start here: /llms.txt · live demo · grade your agent’s work · developer reference.
More guides: LLM-as-a-judge · AI agent quality gate · evaluate AI-generated code.
By modality: customer support · marketing copy · documents · slide decks.