LLM-as-a-judge

Why the evaluator that grades your AI agents should be hostile-by-default, not helpful — and how to run one.

The sycophancy problem

Using one LLM to judge another’s output is now the default way to evaluate AI at scale. The trouble is the model you reach for. Frontier models are aligned to be helpful and agreeable — so when you ask “is this good enough to ship?” they lean toward yes. That is exactly the wrong reflex for an evaluator. A judge that agrees under pressure doesn’t catch flaws; it manufactures false confidence and lets bad agent work through.

This is the part teams get wrong: they point a friendly model at the work, get reassuring scores, and wonder why production still breaks. A grader is only useful if it is aligned to disagree.

Invert the alignment

The fix is an evaluator aligned the opposite way from the agent that produced the work. OtterScore is hostile-by-default: trained and prompted to look for reasons to block, graded against your acceptance policy, and judging the trajectory (how the work was produced) as well as the final artifact. Instead of a thumbs-up, it returns:

a score (0.0–1.0, where 1.0 = ship and 0.0 = must block);
a band — ship / route_to_fix / quarantine / block;
the located flaws (with where they occur) and concrete upgrades.

Run it over HTTP or MCP

It is agent-native — an agent can self-onboard and iterate to a passing band with no human in the loop. Canonical contract: /llms.txt.

# grade your work (async — tolerates a cold GPU)
curl -s https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{"submission":"async","user_prompt":"<what the work was for>",
       "artifact_parts":[{"mime_type":"text/plain","text":"<your work>"}]}'
# -> { "job_id":"...", "status":"queued" }; poll GET /api/v1/eval/jobs/{id}
#    -> { "status":"completed", "result_summary":{ "band":"route_to_fix", "score":0.4 } }

Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp.

How it stays honest

An adversarial critic is only as good as its training data — and the only data worth training on is the work that fools a strong discriminator. OtterScore is trained on a hard corpus built by keeping the cases a strong critic misses and discarding the easy ones, so it pushes back where a friendly judge waves things through. Promotion is gated on a low sycophancy rate under adversarial probing.

Frequently asked questions

What is an LLM-as-a-judge?

An LLM-as-a-judge is a large language model used to score or grade another model's output — instead of (or before) a human reviewer. You give it the work, a rubric or set of criteria, and it returns a judgment. It's the standard pattern for evaluating AI agent output at scale.

Why do general LLM judges rubber-stamp?

Frontier models are aligned to be helpful and agreeable, so when asked 'is this good?' they lean toward yes. That sycophancy is fine for a chat assistant and fatal for an evaluator: a judge that agrees under pressure manufactures false confidence and lets flawed work through. The reward signal it was trained on is the opposite of the one an acceptance gate needs.

What's the alternative to a friendly LLM judge?

An evaluator aligned the opposite way: hostile-by-default — trained and prompted to find reasons to block, not to flatter — graded against your explicit acceptance policy, and evaluating the trajectory (how the work was produced) as well as the final artifact. It returns a score, a band (ship / route to fix / quarantine / block), located flaws, and concrete fixes.

Can I run an adversarial critic over an API?

Yes. OtterScore is a hostile-by-default critic you call over HTTP or a hosted MCP server. An agent can self-onboard for a free key, grade its own work, read the flaws, and iterate to a passing band with no human in the loop.

The sycophancy problem

This is the part teams get wrong: they point a friendly model at the work, get reassuring scores, and wonder why production still breaks. A grader is only useful if it is aligned to disagree.

Invert the alignment

a score (0.0–1.0, where 1.0 = ship and 0.0 = must block);

a band — ship / route_to_fix / quarantine / block;

the located flaws (with where they occur) and concrete upgrades.

Run it over HTTP or MCP

It is agent-native — an agent can self-onboard and iterate to a passing band with no human in the loop. Canonical contract: /llms.txt.

# grade your work (async — tolerates a cold GPU) curl -s https://api.seaotter.ai/api/v1/eval/jobs \ -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \ -d '{"submission":"async","user_prompt":"<what the work was for>", "artifact_parts":[{"mime_type":"text/plain","text":"<your work>"}]}' # -> { "job_id":"...", "status":"queued" }; poll GET /api/v1/eval/jobs/{id} # -> { "status":"completed", "result_summary":{ "band":"route_to_fix", "score":0.4 } }

Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp.