Automatic AI agent validation
Wire a hostile-by-default critic into your coding agent’s end-of-task hook, so it grades the work it just produced and blocks the finish until it clears the bar — automatically, every task.
The one command
AI agents already have an evaluation API available, but not the habit of using it. This makes external validation automatic. Pick your harness:
curl -fsSL https://seaotter.ai/install.sh | sh -s -- claude curl -fsSL https://seaotter.ai/install.sh | sh -s -- codex curl -fsSL https://seaotter.ai/install.sh | sh -s -- openclaw curl -fsSL https://seaotter.ai/install.sh | sh -s -- cursor curl -fsSL https://seaotter.ai/install.sh | sh -s -- hermes curl -fsSL https://seaotter.ai/install.sh | sh -s -- git
It is stdlib-only (python3 + curl) and idempotent. Prefer a package? The same thing ships as a CLI: pip install agent-eval-kit then agent-eval init claude / agent-eval validate.
What it installs
For each harness, one command wires three things:
- the MCP
otter_scoretool, so the agent can grade on demand; - an end-of-task hook that runs the validator and blocks the finish until the verdict clears the bar;
- a standing-instruction block in your
AGENTS.md/CLAUDE.md/SOUL.mdthat makes “validate before done” the rule.
How each harness hooks in
| Harness | What gets wired |
|---|---|
claude — Claude Code | Stop hook + MCP otter_score + CLAUDE.md rule |
codex — OpenAI Codex | MCP + AGENTS.md (optional blocking Stop hook) |
openclaw — OpenClaw | agent_end plugin + SOUL.md |
cursor — Cursor | MCP + always-on rule |
hermes — Hermes / OpenAI-compatible | tools spec + system-prompt fragment |
git — Any harness | git pre-push gate |
What the hook does
At the end of a task the hook grades the work (the git diff, named files, or the harness’s own payload) with OtterScore — a critic aligned to find reasons to block, not to flatter — and returns one of four bands:
ship— meets the policy; the agent finishes.route_to_fix— close; the located flaws are fed back so the agent fixes them and re-validates.quarantine— hold for review.block— fails the policy; must not ship.
It fails open on an infrastructure error (a hiccup never wedges the agent) and closed on a real verdict, is loop-safe, and grades against your acceptance policy via OTTER_POLICY_ID. Grade anything by hand too: python3 ~/.otter/validate.py --files report.pdf.
Why automatic beats on-demand
Connecting a critic once is not the point — validating every task is. A model aligned to be helpful tends to approve its own output, so unreviewed agent work reaches production. Putting the hostile critic in the loop, at the natural end-of-task checkpoint, is what turns “seek external validation” from a thing an agent might remember into a thing that always happens. The canonical machine-readable contract is /llms.txt.
Frequently asked questions
How do I make my AI agent validate its work automatically?
Run one command: curl -fsSL https://seaotter.ai/install.sh | sh -s -- claude (or codex, openclaw, cursor, hermes, git). It wires SeaOtter's OtterScore critic into your harness's end-of-task hook, so the work the agent produced is graded and the finish is blocked until it clears the bar — you never have to remember to validate.
What is a validation hook for an AI coding agent?
A validation hook is a command your agent harness runs at the end of a task. SeaOtter installs one that grades the git diff (or any artifact) with a hostile critic and returns exit 0 to ship or exit 2 with the located flaws to block — which Claude Code, Codex, and OpenClaw feed back into the model so it fixes the flaws and re-validates.
Which agent harnesses are supported?
Claude Code (Stop hook), OpenAI Codex (AGENTS.md + MCP, optional blocking Stop hook), OpenClaw (agent_end plugin), Cursor (MCP + rule), Hermes and any OpenAI-compatible model (a published tools spec + system-prompt fragment), and any harness at all through a git pre-push gate.
Does it block bad work or just warn?
It blocks. On a real verdict below your bar the hook exits non-zero and feeds the located flaws back into the agent loop, so the agent must fix and re-validate before it can finish. It fails open on infrastructure errors (so a hiccup never wedges the agent) and is loop-safe (it releases after a few stuck attempts on the same diff).
Do I need to install a package?
No. The installer and the validator are a single stdlib Python script downloaded by the one-line command — it needs only python3 and curl. There is also an `agent-eval` CLI (pip install agent-eval-kit) with the same `init` and `validate` commands.
Can it grade images, decks, spreadsheets, and other files?
Yes — validation is multimodal. Point it at any artifact (python3 ~/.otter/validate.py --files report.pdf) and it grades code, text, images, decks, spreadsheets, documents, audio, and video against your acceptance policy.
Start here: connect your assistant · /llms.txt · developer reference · AI agent evaluation.
More guides: AI agent quality gate · grade your agent’s work · LLM-as-a-judge.