How to Evaluate AI-Generated Documents Before They Ship

An agent can write a report that reads like it came from your best analyst and is wrong in three places — here's how to catch it before a decision rests on it.

What breaks in AI-generated documents

The dangerous failures in agent-written documents are the ones that survive a read-through. Fabricated specifics are the worst offender: a memo cites "a 2023 McKinsey study" that does not exist, attributes a quote to a named executive who never said it, or references a regulation by a plausible-but-fake section number. The prose is confident and the citation is formatted correctly, so a reader skims past it. The fact is invented. A friendly reviewer checking whether the document reads well will not notice, because it reads perfectly.

Quietly-wrong numbers are the second failure class, and they are specific to documents that synthesize data. The agent pulls a figure from a source, then restates it in the executive summary with a transposed digit, a wrong unit (millions where the source said thousands), a stale period (last year's revenue presented as current), or a percentage that does not reconcile with the table two pages down. Each number looks reasonable in isolation. The error only surfaces when someone cross-checks the summary against the body or the underlying source — which is exactly the work nobody has time to do at the volume agents now produce.

The subtler breaks are structural: claims with no citation at all ("the market is shifting toward X") presented with the same authority as sourced ones; a recommendation whose stated rationale does not actually follow from the evidence in the document; hedges silently dropped so a source's "may indicate" becomes the report's "shows"; and a confident conclusion built on a single weak data point. These pass any check that grades tone or fluency, because the document is fluent. The trajectory — where each claim came from — is where the failure lives.

Why a hostile critic catches what a friendly judge approves

A general-purpose LLM judge is aligned to be helpful and agreeable. Asked "is this report good?", it pattern-matches on coherence, structure, and tone — all of which a well-written fabrication passes. It leans toward approval, so an authoritative-sounding memo gets a high score and a thumbs-up. That is the exact failure mode that lets ungrounded documents ship: the reviewer and the author are both optimizing for the same thing, plausibility.

OtterScore is aligned the other way. It is a hostile-by-default critic whose reward function is to find reasons to block, not to flatter. On a document it grades every claim against the acceptance policy you set: is each factual assertion grounded in a cited, checkable source; do the numbers in the summary reconcile with the body and the source data; are recommendations actually supported by the evidence presented; were any qualifiers dropped. It evaluates the trajectory — the claim-to-source chain — not just the finished prose. An uncited claim or an unreconciled number is a located flaw, not a stylistic note.

Every verdict comes back over the eval API or the hosted MCP server as a score from 0.0 to 1.0, a band (ship, route_to_fix, quarantine, block), the specific flaws with their locations in the document, and concrete upgrades to fix them. A report full of fabricated citations does not get "looks great, minor suggestions" — it lands in block or quarantine, with each unsupported claim named. You bring your own rubric or policy, so "acceptable" is your standard for a board memo or a research brief, not a generic vibe. And every verdict is signed audit evidence, so when a decision rests on the document you can prove what was checked and why it passed.

Grade it before it ships

It is agent-native — an agent can self-onboard and iterate to a passing band with no human in the loop. Canonical contract: /llms.txt.

# 1. get a free key (no human in the loop)
curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' -d '{"email":"you@example.com"}'
# -> { "api_key": "sk-otter-...", "free_quota": 25 }

# 2. grade your work (async -- tolerates the GPU cold-start)
curl -s https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{"submission":"async","user_prompt":"<what the work was for>",
       "artifact_parts":[{"mime_type":"text/plain","text":"<your work>"}]}'
# -> { "job_id": "...", "status": "queued" }

# 3. poll until completed (located flaws come from GET /api/v1/eval/runs/{run_id})
curl -s https://api.seaotter.ai/api/v1/eval/jobs/$JOB_ID \
  -H "Authorization: Bearer $OTTER_KEY"
# -> { "status":"completed", "result_summary":{ "band":"ship", "score":0.95 }, "run_id":"..." }

Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp. Bring your own rubric/policy so the gate enforces your bar.

Frequently asked questions

How do you evaluate an AI-generated document for accuracy?

Grade it against an explicit acceptance policy instead of reading for tone. Send the document to the OtterScore eval API or hosted MCP server with a rubric that requires every factual claim to be grounded in a checkable source, every number in the summary to reconcile with the body, and every recommendation to follow from the evidence. You get back a score (0.0-1.0), a band, the located flaws, and concrete upgrades. The critic checks the claim-to-source trajectory, not just whether the prose reads well.

Why does a regular LLM approve documents with fabricated facts?

General-purpose models are aligned to be helpful and agreeable, so they grade on coherence and fluency — both of which a confident fabrication passes. A fake citation that is formatted correctly and reads authoritatively gets approved. OtterScore is hostile-by-default: its reward is to find reasons to block, so it checks each claim against your grounding policy and flags the uncited or unsupported ones rather than rewarding plausibility.

Can it catch wrong numbers in a report?

Yes — quietly-wrong numbers are a primary failure mode it targets. With a policy that requires reconciliation, it flags figures in the executive summary that do not match the body or the cited source: transposed digits, wrong units (millions vs thousands), stale periods, and percentages that do not add up. Each flaw is located in the document so a reviewer goes straight to the discrepancy instead of cross-checking every figure by hand.

Can I evaluate documents against my own standards, not a generic rubric?

Yes. Bring your own rubric or acceptance policy — what counts as shippable for a board memo, a research brief, or a finance report is your call, including which claims must be cited, how numbers must reconcile, and what hard rules block outright. The policy can be passed inline or stored, and it applies on every grade. Custom bands and hard rules are tighten-only, so a policy can raise the bar but cannot rubber-stamp work the baseline would reject.