Skip to main content
Skip to main content

Evaluate AI-generated code before you merge it

Coding agents ship faster than review can keep up — and they fail in ways green tests don’t catch. Put a hostile code critic in front of them.

Green tests, real bugs

The dangerous failure mode of a coding agent isn’t code that obviously breaks — it’s code that passes the tests and is still wrong. Calls to an API with the wrong signature or one that doesn’t exist. Logic that is locally plausible but incorrect. Silent assumption changes, missing error handling, a confidently-introduced security hole. The code reads well, which is exactly why a friendly LLM review waves it through.

Grade the code, don’t flatter it

OtterScore is a hostile-by-default critic: aligned to find reasons to block, graded against your acceptance criteria, and looking at the diff and the task it was meant to satisfy. For a chunk of AI-written code it returns a score, a band (ship / route_to_fix / quarantine / block), the located flaws with line context, and concrete fixes.

Gate it in the loop (before the PR)

Agent-native: a coding agent can self-grade and iterate to a passing band before it ever opens a pull request. Canonical contract: /llms.txt.

# grade a diff against the task it was for
curl -s https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{"submission":"async","modality":"code",
       "user_prompt":"Add idempotent retry to the payment webhook",
       "artifact_parts":[{"mime_type":"text/plain","text":"<the diff>"}]}'
# -> { "job_id":"...", "status":"queued" }

# poll until the job completes
curl -s https://api.seaotter.ai/api/v1/eval/jobs/$JOB_ID -H "Authorization: Bearer $OTTER_KEY"
# -> { "status":"completed",
#      "result_summary":{ "band":"route_to_fix", "score":0.45, "flaw_count":1 }, "run_id":"..." }

# fetch the located flaws (criterion / severity / evidence / detail / anchor):
curl -s https://api.seaotter.ai/api/v1/eval/runs/$RUN_ID -H "Authorization: Bearer $OTTER_KEY"

Feed the flaws back, regenerate, re-grade — or wire the same call into CI and branch on the band before merge. Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp.

Frequently asked questions

Why isn't passing tests enough for AI-generated code?

Tests check what you thought to assert. Coding agents produce code that passes the tests yet hides real bugs — calls to APIs that don't exist, logic that's locally plausible but wrong, edge cases the tests never covered. Green CI on AI code is necessary, not sufficient; you also need to evaluate the code against acceptance criteria a hostile critic applies.

What flaws are specific to AI-generated code?

Hallucinated or wrong-signature API calls, plausible-but-incorrect logic, silent assumption changes, missing error handling, security issues introduced confidently, and 'looks-right' code that doesn't match the actual installed dependency. A friendly LLM review tends to approve these because the code reads well.

How do I evaluate AI-generated code automatically?

Send the diff or file plus the task it was for to a code-aware critic. OtterScore returns a score, a band (ship / route to fix / quarantine / block), the located flaws with line context, and concrete fixes — over an HTTP API or a hosted MCP server, so a coding agent can self-grade and iterate before opening a PR.

Can it gate AI code in CI?

Yes. Call the critic in your pipeline after the agent writes code and before merge; branch on the band — accept on ship, loop the flaws back to the agent on route_to_fix, block what fails. Every verdict is recorded as signed audit evidence.

Related: AI agent evaluation · AI agent quality gate · LLM-as-a-judge · live demo.