SeaOtter docs

Guides for evaluating and gating AI agent output with OtterScore — a hostile-by-default critic that grades agent work (code, text, documents, decks, spreadsheets, images, and video) against your acceptance policy before it ships. Start with the pillar, then jump to the modality or the API.

Core concepts

AI agent evaluationThe pillar guide: what AI agent evaluation is, the four-band acceptance model, and how to gate agent output before it ships.Grade AI agent work (the API)The hands-on loop: send an artifact + the task to the eval API and get a score, band, located flaws, and concrete fixes.AI agent quality gatePut an automated checkpoint between an agent and production that blocks work which fails your acceptance policy.LLM as a judgeWhy a general LLM judge approves too much, and why your evaluator should be hostile-by-default, not helpful.Automatic agent validationWire OtterScore into your harness's end-of-task hook so every task is graded and the finish is blocked until it clears the bar.

Evaluate by modality

Evaluate AI-generated codeCatch hallucinated APIs, wrong logic, and missing edge cases before you merge.Evaluate AI customer supportGrade agent replies for off-policy answers and wrong facts before they send.Grade AI marketing copyBlock unsupported claims and off-brand voice before launch.Evaluate AI-generated documentsCatch fabricated facts, ungrounded claims, and wrong numbers before they ship.Evaluate AI-generated slide decksCatch cross-slide drift, broken narratives, and uncited claims.

Build with the API

Developer referenceAPI key, MCP setup, .mcp.json, SDK, curl, and the /api/v1/eval score & iterate endpoints.Agent-native discoveryThe machine-readable contract (llms.txt, ai-agent.json) agents use to self-onboard.

Ready to try it? Run the live demo · get a free API key · /llms.txt.