Skip to main content
Skip to main content

Automatic AI agent validation

Wire a hostile-by-default critic into your coding agent’s end-of-task hook, so it grades the work it just produced and blocks the finish until it clears the bar — automatically, every task.

The one command

AI agents already have an evaluation API available, but not the habit of using it. This makes external validation automatic. Pick your harness:

curl -fsSL https://seaotter.ai/install.sh | sh -s -- claude
curl -fsSL https://seaotter.ai/install.sh | sh -s -- codex
curl -fsSL https://seaotter.ai/install.sh | sh -s -- openclaw
curl -fsSL https://seaotter.ai/install.sh | sh -s -- cursor
curl -fsSL https://seaotter.ai/install.sh | sh -s -- hermes
curl -fsSL https://seaotter.ai/install.sh | sh -s -- git

It is stdlib-only (python3 + curl) and idempotent. Prefer a package? The same thing ships as a CLI: pip install agent-eval-kit then agent-eval init claude / agent-eval validate.

What it installs

For each harness, one command wires three things:

  • the MCP otter_score tool, so the agent can grade on demand;
  • an end-of-task hook that runs the validator and blocks the finish until the verdict clears the bar;
  • a standing-instruction block in your AGENTS.md / CLAUDE.md / SOUL.md that makes “validate before done” the rule.

How each harness hooks in

HarnessWhat gets wired
claude — Claude CodeStop hook + MCP otter_score + CLAUDE.md rule
codex — OpenAI CodexMCP + AGENTS.md (optional blocking Stop hook)
openclaw — OpenClawagent_end plugin + SOUL.md
cursor — CursorMCP + always-on rule
hermes — Hermes / OpenAI-compatibletools spec + system-prompt fragment
git — Any harnessgit pre-push gate

What the hook does

At the end of a task the hook grades the work (the git diff, named files, or the harness’s own payload) with OtterScore — a critic aligned to find reasons to block, not to flatter — and returns one of four bands:

  • ship — meets the policy; the agent finishes.
  • route_to_fix — close; the located flaws are fed back so the agent fixes them and re-validates.
  • quarantine — hold for review.
  • block — fails the policy; must not ship.

It fails open on an infrastructure error (a hiccup never wedges the agent) and closed on a real verdict, is loop-safe, and grades against your acceptance policy via OTTER_POLICY_ID. Grade anything by hand too: python3 ~/.otter/validate.py --files report.pdf.

Why automatic beats on-demand

Connecting a critic once is not the point — validating every task is. A model aligned to be helpful tends to approve its own output, so unreviewed agent work reaches production. Putting the hostile critic in the loop, at the natural end-of-task checkpoint, is what turns “seek external validation” from a thing an agent might remember into a thing that always happens. The canonical machine-readable contract is /llms.txt.

Frequently asked questions

How do I make my AI agent validate its work automatically?

Run one command: curl -fsSL https://seaotter.ai/install.sh | sh -s -- claude (or codex, openclaw, cursor, hermes, git). It wires SeaOtter's OtterScore critic into your harness's end-of-task hook, so the work the agent produced is graded and the finish is blocked until it clears the bar — you never have to remember to validate.

What is a validation hook for an AI coding agent?

A validation hook is a command your agent harness runs at the end of a task. SeaOtter installs one that grades the git diff (or any artifact) with a hostile critic and returns exit 0 to ship or exit 2 with the located flaws to block — which Claude Code, Codex, and OpenClaw feed back into the model so it fixes the flaws and re-validates.

Which agent harnesses are supported?

Claude Code (Stop hook), OpenAI Codex (AGENTS.md + MCP, optional blocking Stop hook), OpenClaw (agent_end plugin), Cursor (MCP + rule), Hermes and any OpenAI-compatible model (a published tools spec + system-prompt fragment), and any harness at all through a git pre-push gate.

Does it block bad work or just warn?

It blocks. On a real verdict below your bar the hook exits non-zero and feeds the located flaws back into the agent loop, so the agent must fix and re-validate before it can finish. It fails open on infrastructure errors (so a hiccup never wedges the agent) and is loop-safe (it releases after a few stuck attempts on the same diff).

Do I need to install a package?

No. The installer and the validator are a single stdlib Python script downloaded by the one-line command — it needs only python3 and curl. There is also an `agent-eval` CLI (pip install agent-eval-kit) with the same `init` and `validate` commands.

Can it grade images, decks, spreadsheets, and other files?

Yes — validation is multimodal. Point it at any artifact (python3 ~/.otter/validate.py --files report.pdf) and it grades code, text, images, decks, spreadsheets, documents, audio, and video against your acceptance policy.

Start here: connect your assistant · /llms.txt · developer reference · AI agent evaluation.

More guides: AI agent quality gate · grade your agent’s work · LLM-as-a-judge.