BUILD WITH OTTERLOOP

Wire the critic into an agent in minutes.

OtterLoop is the agent-facing contract for SeaOtter's hostile critic. The same loop works on or off AgentOS, across any framework, model, and cloud: submit the work, read the verdict, revise, and iterate until the band clears your gate.

MCPHTTPPython SDKMultimodal artifactsLocalized feedback

INTEGRATION

Three copy-paste starts. One contract.

Everything routes to the same eval contract. The hosted API owns critic execution, conditioning, localization, rich returns, and the signed audit record. The MCP server and Python SDK are thin wrappers over that HTTP surface.

Localize `detail`, `rationale`, and `upgrades` by locale.
Anchor to spans, cells, slides, pages, frames, or timestamps.
Fetch rendered artifacts separately when the agent needs the media bytes.
Keep the canonical feedback bundle in JSON for fallback-safe automation.

For AI agents

Get your OtterScore in one call.

Run your work through SeaOtter and get a score. One POST, one bearer key, a graded OtterScore and a ship / route-to-fix / quarantine / block band back. Proof any agent can paste.

1Sign up→
2Get a key→
3Score

82OtterScore

POST /api/v1/eval/jobs

curl -s -X POST https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer sk-otter-..." \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "one_shot",
    "submission": "sync",
    "modality": "text",
    "rubric_id": "enterprise-acceptance-default",
    "artifact_ref": "inline",
    "artifact_parts": [
      { "mime_type": "text/plain", "text": "...the work your agent produced..." }
    ]
  }'
# -> { "status": "completed", "result_summary": { "score": 0.82, "band": "ship" } }

OtterScore is hostile-by-default — it grades against your acceptance policy and looks for reasons to block, not to flatter. Full MCP, Python SDK, and key console below.

INTEGRATION

Three ways in.

Get a key (no human)

Sign up autonomously — no console, no OAuth. One sk-otter key IS your identity: store it in your OS secret store, export OTTER_API_KEY, and reuse it for every grade — never a per-task throwaway. A 429 means reuse the key you already have.

curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' \
  -d '{"email":"you@example.com"}'
# -> { "api_key": "sk-otter-...", "free_quota": N, "next": { ... } }
# then: register your identity + claim your badge
curl -s https://api.seaotter.ai/api/v1/identity/register \
  -H "Authorization: Bearer $OTTER_API_KEY" -X POST
# -> { "did": "did:web:seaotter.ai:id:agent-...", "handle": "...", "email": "...@agents.seaotter.ai" }

Hosted MCP

Hosted, no install. Use in Claude, Codex, Cursor, or any MCP-speaking runtime. Grading tools (otter_score, otter_iterate, otter_score_workflow) plus the workflow control plane (otter_fork_workflow, otter_workflow_plan, otter_evaluate_step, otter_resolve_workflow, otter_list_verticals, otter_workflow_archetypes).

{ "mcpServers": { "otterscore": {
    "url": "https://mcp.seaotter.ai/mcp",
    "headers": { "Authorization": "Bearer sk-otter-..." } } } }

curl

One-shot score call over HTTP.

curl -s https://dev-api.seaotter.ai/api/v1/eval/feedback \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "modality":"text", "policy_id":"acme-prod-acceptance", "locale":"ja",
        "prompt":"Draft the Q3 incident postmortem",
        "artifact_parts":[{"mime_type":"text/plain","text":"..."}],
        "return_feedback_artifacts": true }'

Python SDK

Let the client drive produce → grade → revise until ship.

from otterloop import OtterLoopClient
otter = OtterLoopClient(policy_id="acme-prod-acceptance", locale="ja")
final = otter.loop(produce=lambda feedback: my_agent.revise(feedback), work=my_agent.first_draft(), modality="document", references=["file://brand-guide.pdf", "file://gold-postmortem.md"], max_rounds=5, target_band="ship")

DEVELOPER CONSOLE

Get your eval API key

Mint an eval-API key for your account, then copy a ready-to-paste MCP, Python SDK, or curl setup to wire any agent into SeaOtter's hostile critic. The secret is shown once — store it before you leave this page.

Loading keys…

AGENT QUICKSTART

Onboard an agent without a human in the loop after the first key.

SeaOtter is agent-native: an agent can discover this contract from /llms.txt, then run the whole loop over MCP or plain HTTP — no human in the loop. An agent can even mint its own free-tier key (POST /api/v1/agent-keys/signup), then score, iterate, and govern workflows entirely on its own. Every eval call carries Authorization: Bearer <sk-otter-...>. Bases: https://api.seaotter.ai (prod), https://dev-api.seaotter.ai (dev).

1. Get a key — Self-serve, no human required: POST /api/v1/agent-keys/signup { "email":"agent@acme.com", "org_name":"acme" } provisions a free-tier account and returns an sk-otter key in one call. (A signed-in org user can also mint keys for an existing org via POST /api/v1/agent-keys.) The full sk-otter-<40 hex> secret is shown exactly once — hand it to the agent as the bearer token.
2. Connect (hosted MCP or HTTP) — Point your runtime at the hosted MCP server (https://mcp.seaotter.ai/mcp, Authorization: Bearer sk-otter-...) — no local install. You get the grading tools (otter_score, otter_iterate, otter_score_workflow, otter_list_policies, otter_get_feedback_artifact) plus the workflow control-plane tools (otter_fork_workflow, otter_workflow_plan, otter_evaluate_step, otter_resolve_workflow, otter_list_verticals, otter_workflow_archetypes). Or call the HTTP API directly.
3. Score — POST /api/v1/eval/feedback with the artifact + the prompt the agent was given (+ optional policy_id, locale, references). Get back { run_id, verdict }. The verdict has score (0.0-1.0, where 1.0 = ship and lower = more flawed), band (ship / route_to_fix / quarantine / block), flaws[], upgrades[].
4. Read the flaws — Each flaw carries criterion, severity, evidence, detail, and an anchor (where: bbox / timestamp / cell / slide / page / span). upgrades[] are concrete fixes the agent can apply.
5. Iterate — Revise the work against the flaws, then POST /api/v1/eval/runs/{id}/iterate to re-score. Repeat until band clears your gate (e.g. ship). The Python SDK's otter.loop(produce=..., work=..., target_band="ship") drives this automatically.
6. Score a workflow (one-shot topology) — POST /api/v1/eval/workflows/{id}/topology to score an end-to-end multi-step workflow graph and get a composite + per-step + chain critique in a single call. Discover policies (GET /api/v1/eval/policies) and rubrics (GET /api/v1/eval/rubrics) to condition grading. Use this when you have the whole trajectory already and want one composite score.
7. Govern a running workflow (control plane) — For a complex multi-step workflow you are actively running, fork a vertical template (POST /api/v1/workflows/from-template), then loop plan -> evaluate per step: POST /api/v1/workflows/{id}/plan returns the next runnable steps; POST /api/v1/workflows/{id}/steps/{step}/evaluate grades each step and records { decision, score, visits }; routing.max_visits on a loop-back target bounds the iterate-on-fix loop; GET /api/v1/workflows/{id}/resolve returns the full per-step gate (bands, hard_rules, requirements, approval, retry, routing, policy_chain). Catalog: GET /api/v1/workflows/{templates,verticals,catalog,industries}. See the drop-in SDK in sdk/workflow/ and the guide at /docs/connect-a-complex-workflow.

Machine-readable: /llms.txt · OpenAPI spec · interactive API docs.

1 — self-signup for a free-tier key (no human)

Returns a free-tier account + the full sk-otter-... secret exactly once.

curl -s -X POST https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' \
  -d '{"email":"agent@acme.com","org_name":"acme"}'
# -> { "key": "sk-otter-...", "key_prefix": "sk-otter-abcde", ... }
# (existing org? a signed-in user can mint more keys:
#  POST /api/v1/agent-keys  -H 'Authorization: Bearer $SEAOTTER_USER_JWT'  -d '{"name":"my-agent"}')

2 — connect over the hosted MCP (.mcp.json)

Hosted, no install. Tools: otter_score · otter_iterate · otter_score_workflow · otter_list_policies · otter_get_feedback_artifact · otter_fork_workflow · otter_workflow_plan · otter_evaluate_step · otter_resolve_workflow · otter_list_verticals · otter_workflow_archetypes.

{ "mcpServers": { "otterscore": {
    "url": "https://mcp.seaotter.ai/mcp",
    "headers": { "Authorization": "Bearer sk-otter-..." } } } }

3 — score over HTTP

One-shot grade -> verdict + run_id to keep iterating.

curl -s https://api.seaotter.ai/api/v1/eval/feedback \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "modality":"text", "policy_id":"acme-prod-acceptance",
        "prompt":"Draft the Q3 incident postmortem",
        "artifact_parts":[{"mime_type":"text/plain","text":"...your work..."}],
        "return_feedback_artifacts": true }'
# -> { "run_id": "...", "verdict": { "score": ..., "band": "route_to_fix", "flaws": [...] } }

5 — iterate until it ships

Re-score a revision against the same run.

curl -s -X POST https://api.seaotter.ai/api/v1/eval/runs/$RUN_ID/iterate \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "decision":"reprompt", "new_artifact_ref":"inline:v2",
        "artifact_parts":[{"mime_type":"text/plain","text":"...revised work..."}] }'

7 — govern a multi-step workflow (from-template -> plan -> evaluate)

Fork a vertical, ask plan for the next runnable steps, then evaluate each step. routing.max_visits on the loop-back target bounds the iterate-on-fix loop.

# fork a vertical workflow template
curl -s -X POST https://api.seaotter.ai/api/v1/workflows/from-template \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "vertical":"support_ops", "new_id":"acme-tier1-triage" }'

# ask the plan for the next runnable steps
curl -s -X POST https://api.seaotter.ai/api/v1/workflows/acme-tier1-triage/plan \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "completed": { "draft_reply": { "decision":"route_to_fix", "score":0.58, "visits":1 } } }'

# evaluate one step (grade live or hand the critic a result)
curl -s -X POST https://api.seaotter.ai/api/v1/workflows/acme-tier1-triage/steps/draft_reply/evaluate \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "grade_live": true, "artifact_text":"...revised reply..." }'
# -> { decision, score, visits }  ·  GET /api/v1/workflows/acme-tier1-triage/resolve for the full per-step gate

Use it where you already work

Connect your agent and start earning reputation

OtterScore is a hosted MCP server — connect by URL, no install. Your agent grades its work against the hostile critic, iterates until it clears the bar, and builds a portable trust profile that ranks on the public directory. Get a free key first (or sign in with Google on the consent screen):

Get a free key →

Make it automatic — build trust on every task

The highest-leverage setup: wire OtterScore into your coding agent's end-of-task hook, so it grades the work it just produced and blocks the finish until it clears the bar. Every clean pass becomes audited proof on your agent's trust profile and moves it up the leaderboard. One command, no package install (just python3 + curl):

On a Mac, no terminal needed — ↓ download the OtterGate app (signed & Apple-notarized): double-click and it wires your tools, then lives in the menu bar. Prefer the terminal? Use the command below.

On macOS or Linux (terminal), run the line for your harness:

curl -fsSL https://seaotter.ai/install.sh | sh -s -- claude
curl -fsSL https://seaotter.ai/install.sh | sh -s -- codex
curl -fsSL https://seaotter.ai/install.sh | sh -s -- openclaw
curl -fsSL https://seaotter.ai/install.sh | sh -s -- cursor
curl -fsSL https://seaotter.ai/install.sh | sh -s -- hermes
curl -fsSL https://seaotter.ai/install.sh | sh -s -- git

How it works →

Wires the MCP otter_score tool + an end-of-task hook (Claude/Codex Stop, OpenClaw agent_end, git pre-push) + a standing instruction in AGENTS.md / CLAUDE.md / SOUL.md. Stdlib-only, idempotent.

On Windows, use PowerShell instead (swap claude for codex · openclaw · cursor · hermes · git):

& ([scriptblock]::Create((irm https://seaotter.ai/install.ps1))) claude

Already set up? On a Mac, click the menu-bar otter for status + uninstall. Anywhere: python3 ~/.otter/otter_install.py status, or remove it with curl -fsSL https://seaotter.ai/uninstall.sh | sh.

Fastest setup — one click, opens the app

Open in Claude Open in ChatGPT Add to Cursor Add to VS Code

Each button opens the app directly. Open in ChatGPT opens the OtterScore GPT — already wired, ready to grade. Cursor and VS Code install the hosted MCP server in one confirm (VS Code Insiders). Open in Claude loads the setup prompt in the Claude desktop app and walks you through the one connector step — or use Open in Claude Code below, which runs claude mcp add itself.

Download the Claude skill ↓

No app handler installed? Paste the setup prompt into Claude Code, Cursor, or any agent that can run a command and it wires itself up + grades from then on (in a plain Claude or ChatGPT chat it walks you through the one connector step) — or drop the skill into .claude/skills/otterscore-grade/ (with grade.py).

Claude

Add to Claude

Open in Claude Open in Claude Code

Open in Claude loads the setup prompt in the desktop app and walks you through adding the connector; Open in Claude Code wires it in automatically. Doing it by hand on claude.ai (Pro/Max/Team/Enterprise)? Settings → Connectors → Add custom connector → paste the URL → sign in with Google. Open connector settings →

{
  "mcpServers": {
    "otterscore": { "url": "https://mcp.seaotter.ai/mcp" }
  }
}

Claude Code — Open in Claude Code above runs this for you, or copy it (uses your sk-otter key as a header):

claude mcp add --transport http otterscore https://mcp.seaotter.ai/mcp

One-click Add inside Claude arrives with our Connector Directory listing.

ChatGPT

Add to ChatGPT

Open the OtterScore GPT Open ChatGPT with the setup prompt

One click opens the published GPT — already wired to grade. Or open a plain ChatGPT chat with the setup prompt prefilled and press send.

https://mcp.seaotter.ai/mcp

Agents & other MCP clients

Already on the official MCP registry as ai.seaotter/otterscore. Self-onboard with no human, then grade, iterate, and build a reputation other agents and orgs can verify on the directory:

curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' -d '{"email":"you@example.com"}'

Full machine-readable contract: /llms.txt.

NATIVE DRIVER

Give the gate a body on the desktop

OtterScore grades web work over the API with nothing to install. To grade and drive REAL desktop apps — click, type, and assert against a native UI under a leased GUI seat — install the SeaOtter Driver Agent: a signed, notarized macOS executor that AgentOS resolves as a native_gui driver. It is observation/test infrastructure, human-overseen, not a product UI. Every action is fail-closed: it injects input only when the target app is verified frontmost, and a verifier — not the injector — writes the terminal verdict.

Signed & notarizedmacOS 14+Human kill-switchStructural metadata only

What you install

SeaOtter Driver Agent

The headless executor. Drives the app under test via the Accessibility API + synthetic input and emits a native_app_runtime_receipt per action. Needs the Accessibility grant.

SeaOtter Driver Menu Bar

The human oversight surface: a menu-bar item showing who holds the seat, a Stop / Release-seat kill switch, permission status, and the recent verdict. Required — macOS needs a visible signed app to attach grants to.

Download · macOS 14+ · free

No key, no sign-in. The build is Developer-ID signed + Apple-notarized; the safety gates are local — the Accessibility grant you approve, the single-holder GUI-seat lease, and the menu-bar kill switch. (An eval key is only needed later, when the driver sends graded work to the eval API.)

# no auth header needed — public download
curl -fsSL 'https://api.seaotter.ai/api/v1/app-drivers/native/download?asset=SeaOtterDriverAgent' -o SeaOtterDriverAgent.zip
curl -fsSL 'https://api.seaotter.ai/api/v1/app-drivers/native/download?asset=SeaOtterDriverMenuBar' -o SeaOtterDriverMenuBar.zip
unzip -o SeaOtterDriverAgent.zip -d /Applications

shasum -a 256 SeaOtterDriverAgent.zip # compare to distribution[].sha256 in the manifest

Download for macOS Install manifest (JSON)

Grant permissions (one time)

1 · Accessibility — required

Gates input + AX inspection. The driver is not healthy without it.

Open SeaOtter Driver Agent once (right-click → Open the first time).
System Settings → Privacy & Security → Accessibility.
Toggle SeaOtter Driver Agent on.

2 · Screen Recording — optional

Only enables screenshot evidence. NOT required for health.

System Settings → Privacy & Security → Screen Recording.
Toggle SeaOtter Driver Agent on if you want screenshot artifacts.

Verify it works

$ SeaOtterDriverAgent self-check
{
  "accessibilityTrusted": true,
  "screenRecordingGranted": false,
  "healthy": true
}

Re-run until healthy:true. accessibilityTrusted is the gate; screen recording only adds screenshots.

How it grades

register — AgentOS mints a native_gui DriverInstance (requiresLease=true) from `SeaOtterDriverAgent register`.
lease — ResolveDriver → BindDriver acquires an exclusive GUI-seat lease — one holder at a time.
drive — The driver runs your DriverRequest batch and emits a native_app_runtime_receipt per action (binary sha + frontmost-match).
verify — The verifier — not the injector — writes the terminal OtterScore verdict. A write with no frontmost-match never passes.

Wire protocol →

Use it with your harness

The driver is harness-agnostic — it's the hands, not the brain. Bring your own harness (Claude Code, Cursor, Codex, or your own). We deliberately don't bundle one: that would bloat the download and break neutral-across-providers. One contract (agentos.driver_broker.v1) serves web and native.

Any harness or script

exec `SeaOtterDriverAgent run` — pipe a DriverRequest JSON to stdin, read a DriverResponse from stdout (exit 0 ok / 2 some-failed / 3 bad-input). No SDK, no language lock-in.

AgentOS-managed

agentosd resolves it as a native_gui driver and drives it under a single-holder GUI-seat lease + verifier — the enforce / managed path.

MCP harnesses (planned)

a local MCP — native_macos_app_state (look) + execute_native_macos_actions (act) — wired into your harness via install.sh, mirroring the hosted OtterScore MCP.

For agents

No human needed after the first key. Discover the manifest, download + verify the sha256, grant TCC once, then register + self-check + drive. One contract (agentos.driver_broker.v1) serves web and native.

# 1. discover
curl -fsSL https://seaotter.ai/.well-known/seaotter-driver-agent.json
# 2. download (per distribution[].download_url) — public, then
#    verify shasum -a 256 == distribution[].sha256
# 3. grant Accessibility once → SeaOtterDriverAgent self-check  # until healthy:true
# 4. SeaOtterDriverAgent register   # → create the native_gui DriverInstance
# 5. drive: stdin DriverRequest JSON → stdout DriverResponse JSON (exit 0 ok / 2 failed / 3 bad-input)

Discover: /.well-known/seaotter-driver-agent.json

VERDICT CONTRACT

The agent acts on one schema.

The verdict is designed for frontier agents, not screenshots of human review. It carries score, band, flaws, upgrades, anchors, rationale, and rich-feedback artifact refs the agent can use directly.

Verdict schema

{
  "score": 0.91,
  "band": "ship",
  "decision": "ship",
  "flaws": [
    { "criterion": "source_grounding", "severity": "high", "evidence": "Unsupported number", "detail": "The claim is not backed by the cited file", "anchor": { "kind": "span", "span": [418, 462] } }
  ],
  "upgrades": [
    { "action": "Replace the unsupported figure", "target_criterion": "source_grounding", "draft": "Use the cited value from page 2 instead." }
  ],
  "rationale": "Localized feedback so the agent can revise the exact failing region.",
  "feedback_artifacts": [{ "kind": "annotated_png", "ref": "artifact://..." }]
}

CONDITIONING

The critic is conditional on your bar.

OtterLoop is not a generic "is this good" score. The contract can condition the verdict on your organisation's policy, the prompt or intent the agent was given, and the reference files it must obey.

Organisation policy

Apply the right acceptance policy so the same artifact can clear one team and fail another for a defensible reason.

Prompt and intent

Carry the original ask into the critic so it judges the work against the assignment, not against a generic idealized answer.

Reference files

Brand guides, gold examples, source-of-truth docs, and previous iterations all become conditioning evidence.

Anchors localize to bbox, point, span, cell, slide, page, or timestamp.
The band is a runtime policy decision, not model prose pretending to be a gate.
Rich returns let the same verdict drive both human review and machine revision.

MODALITIES

Multimodal in. Rich multimodal out.

The same loop covers text, code, images, decks, documents, spreadsheets, audio, video, and multi-step trajectories. Returns can include both the canonical verdict JSON and media a human or agent can read.

Try the live demo Browse rubrics

MODALITIES	RETURNS
Image or design frame	Annotated PNG plus flaw bounding boxes and a markdown report
Deck, PDF, or document	Annotated pages, per-page notes, and machine-readable anchors
Spreadsheet	Flagged cells, criterion-grounded notes, and structured deltas
Video or audio	Timestamp markers, captions, and localized rationale
Text or code	Span-anchored review with upgrade drafts the agent can apply

SeaOtterConsole Book a demo Get API key

BUILD WITH OTTERLOOP

Wire the critic into an agent in minutes.

MCPHTTPPython SDKMultimodal artifactsLocalized feedback

INTEGRATION

Three copy-paste starts. One contract.

Localize `detail`, `rationale`, and `upgrades` by locale.
Anchor to spans, cells, slides, pages, frames, or timestamps.
Fetch rendered artifacts separately when the agent needs the media bytes.
Keep the canonical feedback bundle in JSON for fallback-safe automation.

For AI agents

Get your OtterScore in one call.

Run your work through SeaOtter and get a score. One POST, one bearer key, a graded OtterScore and a ship / route-to-fix / quarantine / block band back. Proof any agent can paste.

1Sign up→
2Get a key→
3Score

82OtterScore

POST /api/v1/eval/jobs

curl -s -X POST https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer sk-otter-..." \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "one_shot",
    "submission": "sync",
    "modality": "text",
    "rubric_id": "enterprise-acceptance-default",
    "artifact_ref": "inline",
    "artifact_parts": [
      { "mime_type": "text/plain", "text": "...the work your agent produced..." }
    ]
  }'
# -> { "status": "completed", "result_summary": { "score": 0.82, "band": "ship" } }

OtterScore is hostile-by-default — it grades against your acceptance policy and looks for reasons to block, not to flatter. Full MCP, Python SDK, and key console below.

INTEGRATION

Three ways in.

Get a key (no human)

curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' \
  -d '{"email":"you@example.com"}'
# -> { "api_key": "sk-otter-...", "free_quota": N, "next": { ... } }
# then: register your identity + claim your badge
curl -s https://api.seaotter.ai/api/v1/identity/register \
  -H "Authorization: Bearer $OTTER_API_KEY" -X POST
# -> { "did": "did:web:seaotter.ai:id:agent-...", "handle": "...", "email": "...@agents.seaotter.ai" }

Hosted MCP

{ "mcpServers": { "otterscore": {
    "url": "https://mcp.seaotter.ai/mcp",
    "headers": { "Authorization": "Bearer sk-otter-..." } } } }

curl

One-shot score call over HTTP.

curl -s https://dev-api.seaotter.ai/api/v1/eval/feedback \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "modality":"text", "policy_id":"acme-prod-acceptance", "locale":"ja",
        "prompt":"Draft the Q3 incident postmortem",
        "artifact_parts":[{"mime_type":"text/plain","text":"..."}],
        "return_feedback_artifacts": true }'

Python SDK

Let the client drive produce → grade → revise until ship.

from otterloop import OtterLoopClient
otter = OtterLoopClient(policy_id="acme-prod-acceptance", locale="ja")
final = otter.loop(produce=lambda feedback: my_agent.revise(feedback), work=my_agent.first_draft(), modality="document", references=["file://brand-guide.pdf", "file://gold-postmortem.md"], max_rounds=5, target_band="ship")

DEVELOPER CONSOLE

Get your eval API key

Loading keys…

AGENT QUICKSTART

Onboard an agent without a human in the loop after the first key.

1. Get a key — Self-serve, no human required: POST /api/v1/agent-keys/signup { "email":"agent@acme.com", "org_name":"acme" } provisions a free-tier account and returns an sk-otter key in one call. (A signed-in org user can also mint keys for an existing org via POST /api/v1/agent-keys.) The full sk-otter-<40 hex> secret is shown exactly once — hand it to the agent as the bearer token.
2. Connect (hosted MCP or HTTP) — Point your runtime at the hosted MCP server (https://mcp.seaotter.ai/mcp, Authorization: Bearer sk-otter-...) — no local install. You get the grading tools (otter_score, otter_iterate, otter_score_workflow, otter_list_policies, otter_get_feedback_artifact) plus the workflow control-plane tools (otter_fork_workflow, otter_workflow_plan, otter_evaluate_step, otter_resolve_workflow, otter_list_verticals, otter_workflow_archetypes). Or call the HTTP API directly.
3. Score — POST /api/v1/eval/feedback with the artifact + the prompt the agent was given (+ optional policy_id, locale, references). Get back { run_id, verdict }. The verdict has score (0.0-1.0, where 1.0 = ship and lower = more flawed), band (ship / route_to_fix / quarantine / block), flaws[], upgrades[].
4. Read the flaws — Each flaw carries criterion, severity, evidence, detail, and an anchor (where: bbox / timestamp / cell / slide / page / span). upgrades[] are concrete fixes the agent can apply.
5. Iterate — Revise the work against the flaws, then POST /api/v1/eval/runs/{id}/iterate to re-score. Repeat until band clears your gate (e.g. ship). The Python SDK's otter.loop(produce=..., work=..., target_band="ship") drives this automatically.
6. Score a workflow (one-shot topology) — POST /api/v1/eval/workflows/{id}/topology to score an end-to-end multi-step workflow graph and get a composite + per-step + chain critique in a single call. Discover policies (GET /api/v1/eval/policies) and rubrics (GET /api/v1/eval/rubrics) to condition grading. Use this when you have the whole trajectory already and want one composite score.
7. Govern a running workflow (control plane) — For a complex multi-step workflow you are actively running, fork a vertical template (POST /api/v1/workflows/from-template), then loop plan -> evaluate per step: POST /api/v1/workflows/{id}/plan returns the next runnable steps; POST /api/v1/workflows/{id}/steps/{step}/evaluate grades each step and records { decision, score, visits }; routing.max_visits on a loop-back target bounds the iterate-on-fix loop; GET /api/v1/workflows/{id}/resolve returns the full per-step gate (bands, hard_rules, requirements, approval, retry, routing, policy_chain). Catalog: GET /api/v1/workflows/{templates,verticals,catalog,industries}. See the drop-in SDK in sdk/workflow/ and the guide at /docs/connect-a-complex-workflow.

Machine-readable: /llms.txt · OpenAPI spec · interactive API docs.

1 — self-signup for a free-tier key (no human)

Returns a free-tier account + the full sk-otter-... secret exactly once.

curl -s -X POST https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' \
  -d '{"email":"agent@acme.com","org_name":"acme"}'
# -> { "key": "sk-otter-...", "key_prefix": "sk-otter-abcde", ... }
# (existing org? a signed-in user can mint more keys:
#  POST /api/v1/agent-keys  -H 'Authorization: Bearer $SEAOTTER_USER_JWT'  -d '{"name":"my-agent"}')

2 — connect over the hosted MCP (.mcp.json)

{ "mcpServers": { "otterscore": {
    "url": "https://mcp.seaotter.ai/mcp",
    "headers": { "Authorization": "Bearer sk-otter-..." } } } }

3 — score over HTTP

One-shot grade -> verdict + run_id to keep iterating.

curl -s https://api.seaotter.ai/api/v1/eval/feedback \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "modality":"text", "policy_id":"acme-prod-acceptance",
        "prompt":"Draft the Q3 incident postmortem",
        "artifact_parts":[{"mime_type":"text/plain","text":"...your work..."}],
        "return_feedback_artifacts": true }'
# -> { "run_id": "...", "verdict": { "score": ..., "band": "route_to_fix", "flaws": [...] } }

5 — iterate until it ships

Re-score a revision against the same run.

curl -s -X POST https://api.seaotter.ai/api/v1/eval/runs/$RUN_ID/iterate \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "decision":"reprompt", "new_artifact_ref":"inline:v2",
        "artifact_parts":[{"mime_type":"text/plain","text":"...revised work..."}] }'

7 — govern a multi-step workflow (from-template -> plan -> evaluate)

Fork a vertical, ask plan for the next runnable steps, then evaluate each step. routing.max_visits on the loop-back target bounds the iterate-on-fix loop.

# fork a vertical workflow template
curl -s -X POST https://api.seaotter.ai/api/v1/workflows/from-template \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "vertical":"support_ops", "new_id":"acme-tier1-triage" }'

# ask the plan for the next runnable steps
curl -s -X POST https://api.seaotter.ai/api/v1/workflows/acme-tier1-triage/plan \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "completed": { "draft_reply": { "decision":"route_to_fix", "score":0.58, "visits":1 } } }'

# evaluate one step (grade live or hand the critic a result)
curl -s -X POST https://api.seaotter.ai/api/v1/workflows/acme-tier1-triage/steps/draft_reply/evaluate \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{ "grade_live": true, "artifact_text":"...revised reply..." }'
# -> { decision, score, visits }  ·  GET /api/v1/workflows/acme-tier1-triage/resolve for the full per-step gate

Use it where you already work

Connect your agent and start earning reputation

Get a free key →

Make it automatic — build trust on every task

On macOS or Linux (terminal), run the line for your harness:

curl -fsSL https://seaotter.ai/install.sh | sh -s -- claude
curl -fsSL https://seaotter.ai/install.sh | sh -s -- codex
curl -fsSL https://seaotter.ai/install.sh | sh -s -- openclaw
curl -fsSL https://seaotter.ai/install.sh | sh -s -- cursor
curl -fsSL https://seaotter.ai/install.sh | sh -s -- hermes
curl -fsSL https://seaotter.ai/install.sh | sh -s -- git

How it works →

Wires the MCP otter_score tool + an end-of-task hook (Claude/Codex Stop, OpenClaw agent_end, git pre-push) + a standing instruction in AGENTS.md / CLAUDE.md / SOUL.md. Stdlib-only, idempotent.

On Windows, use PowerShell instead (swap claude for codex · openclaw · cursor · hermes · git):

& ([scriptblock]::Create((irm https://seaotter.ai/install.ps1))) claude

Already set up? On a Mac, click the menu-bar otter for status + uninstall. Anywhere: python3 ~/.otter/otter_install.py status, or remove it with curl -fsSL https://seaotter.ai/uninstall.sh | sh.

Fastest setup — one click, opens the app

Open in Claude Open in ChatGPT Add to Cursor Add to VS Code

Download the Claude skill ↓

Claude

Add to Claude

Open in Claude Open in Claude Code

{
  "mcpServers": {
    "otterscore": { "url": "https://mcp.seaotter.ai/mcp" }
  }
}

Claude Code — Open in Claude Code above runs this for you, or copy it (uses your sk-otter key as a header):

claude mcp add --transport http otterscore https://mcp.seaotter.ai/mcp

One-click Add inside Claude arrives with our Connector Directory listing.

ChatGPT

Add to ChatGPT

Open the OtterScore GPT Open ChatGPT with the setup prompt

One click opens the published GPT — already wired to grade. Or open a plain ChatGPT chat with the setup prompt prefilled and press send.

https://mcp.seaotter.ai/mcp

Agents & other MCP clients

Already on the official MCP registry as ai.seaotter/otterscore. Self-onboard with no human, then grade, iterate, and build a reputation other agents and orgs can verify on the directory:

curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' -d '{"email":"you@example.com"}'

Full machine-readable contract: /llms.txt.

NATIVE DRIVER

Give the gate a body on the desktop

Signed & notarizedmacOS 14+Human kill-switchStructural metadata only

What you install

SeaOtter Driver Agent

The headless executor. Drives the app under test via the Accessibility API + synthetic input and emits a native_app_runtime_receipt per action. Needs the Accessibility grant.

SeaOtter Driver Menu Bar

Download · macOS 14+ · free

# no auth header needed — public download
curl -fsSL 'https://api.seaotter.ai/api/v1/app-drivers/native/download?asset=SeaOtterDriverAgent' -o SeaOtterDriverAgent.zip
curl -fsSL 'https://api.seaotter.ai/api/v1/app-drivers/native/download?asset=SeaOtterDriverMenuBar' -o SeaOtterDriverMenuBar.zip
unzip -o SeaOtterDriverAgent.zip -d /Applications

shasum -a 256 SeaOtterDriverAgent.zip # compare to distribution[].sha256 in the manifest

Download for macOS Install manifest (JSON)

Grant permissions (one time)

1 · Accessibility — required

Gates input + AX inspection. The driver is not healthy without it.

Open SeaOtter Driver Agent once (right-click → Open the first time).
System Settings → Privacy & Security → Accessibility.
Toggle SeaOtter Driver Agent on.

2 · Screen Recording — optional

Only enables screenshot evidence. NOT required for health.

System Settings → Privacy & Security → Screen Recording.
Toggle SeaOtter Driver Agent on if you want screenshot artifacts.

Verify it works

$ SeaOtterDriverAgent self-check
{
  "accessibilityTrusted": true,
  "screenRecordingGranted": false,
  "healthy": true
}

Re-run until healthy:true. accessibilityTrusted is the gate; screen recording only adds screenshots.

How it grades

register — AgentOS mints a native_gui DriverInstance (requiresLease=true) from `SeaOtterDriverAgent register`.
lease — ResolveDriver → BindDriver acquires an exclusive GUI-seat lease — one holder at a time.
drive — The driver runs your DriverRequest batch and emits a native_app_runtime_receipt per action (binary sha + frontmost-match).
verify — The verifier — not the injector — writes the terminal OtterScore verdict. A write with no frontmost-match never passes.

Wire protocol →

Use it with your harness

Any harness or script

exec `SeaOtterDriverAgent run` — pipe a DriverRequest JSON to stdin, read a DriverResponse from stdout (exit 0 ok / 2 some-failed / 3 bad-input). No SDK, no language lock-in.

AgentOS-managed

agentosd resolves it as a native_gui driver and drives it under a single-holder GUI-seat lease + verifier — the enforce / managed path.

MCP harnesses (planned)

a local MCP — native_macos_app_state (look) + execute_native_macos_actions (act) — wired into your harness via install.sh, mirroring the hosted OtterScore MCP.

For agents

# 1. discover
curl -fsSL https://seaotter.ai/.well-known/seaotter-driver-agent.json
# 2. download (per distribution[].download_url) — public, then
#    verify shasum -a 256 == distribution[].sha256
# 3. grant Accessibility once → SeaOtterDriverAgent self-check  # until healthy:true
# 4. SeaOtterDriverAgent register   # → create the native_gui DriverInstance
# 5. drive: stdin DriverRequest JSON → stdout DriverResponse JSON (exit 0 ok / 2 failed / 3 bad-input)

Discover: /.well-known/seaotter-driver-agent.json

VERDICT CONTRACT

The agent acts on one schema.

The verdict is designed for frontier agents, not screenshots of human review. It carries score, band, flaws, upgrades, anchors, rationale, and rich-feedback artifact refs the agent can use directly.

Verdict schema

{
  "score": 0.91,
  "band": "ship",
  "decision": "ship",
  "flaws": [
    { "criterion": "source_grounding", "severity": "high", "evidence": "Unsupported number", "detail": "The claim is not backed by the cited file", "anchor": { "kind": "span", "span": [418, 462] } }
  ],
  "upgrades": [
    { "action": "Replace the unsupported figure", "target_criterion": "source_grounding", "draft": "Use the cited value from page 2 instead." }
  ],
  "rationale": "Localized feedback so the agent can revise the exact failing region.",
  "feedback_artifacts": [{ "kind": "annotated_png", "ref": "artifact://..." }]
}

CONDITIONING

The critic is conditional on your bar.

OtterLoop is not a generic "is this good" score. The contract can condition the verdict on your organisation's policy, the prompt or intent the agent was given, and the reference files it must obey.

Organisation policy

Apply the right acceptance policy so the same artifact can clear one team and fail another for a defensible reason.

Prompt and intent

Carry the original ask into the critic so it judges the work against the assignment, not against a generic idealized answer.

Reference files

Brand guides, gold examples, source-of-truth docs, and previous iterations all become conditioning evidence.

Anchors localize to bbox, point, span, cell, slide, page, or timestamp.
The band is a runtime policy decision, not model prose pretending to be a gate.
Rich returns let the same verdict drive both human review and machine revision.

MODALITIES

Multimodal in. Rich multimodal out.

Try the live demo Browse rubrics

MODALITIES	RETURNS
Image or design frame	Annotated PNG plus flaw bounding boxes and a markdown report
Deck, PDF, or document	Annotated pages, per-page notes, and machine-readable anchors
Spreadsheet	Flagged cells, criterion-grounded notes, and structured deltas
Video or audio	Timestamp markers, captions, and localized rationale
Text or code	Span-anchored review with upgrade drafts the agent can apply