Skip to main content
Skip to main content

Grade AI Marketing Copy: Catch Unsupported Claims Before Launch

Score agent-written ad copy, landing pages, and emails against your brand and legal policy — and block the fluent claims no one can back up.

What breaks in AI-written marketing copy

Agent-written copy fails in ways that read perfectly. The model produces a fluent, confident, on-brand-sounding paragraph and buries an unsubstantiated claim inside it: "the #1 platform," "2x faster than competitors," "trusted by thousands," "clinically proven." Nothing in the brief substantiated any of it. The number was invented to fill the rhythm of the sentence, and a comparative or superlative claim you can't cite is a legal exposure, not a tagline. A human skim approves it because it sounds like every ad they've ever read.

Voice and compliance drift are just as invisible. The copy lands off-brand — too casual for a regulated insurer, too hype-heavy for an enterprise security buyer, hedged into mush for a brand that's supposed to be direct — but it's grammatical and well-structured, so a quick read waves it through. Worse, the agent silently drops the disclaimer: the "results may vary," the APR footnote, the "paid partnership" tag, the eligibility caveat. The omission is invisible by definition — you can't see the line that isn't there — and the channel that needed it (financial promotion, health claim, sweepstakes) is exactly the one with regulatory teeth.

Then there's the trajectory. The final copy may look clean while the path that produced it pulled a competitor's claim verbatim, fabricated a statistic to satisfy a "make it punchier" instruction, or ignored the brand guide the brief explicitly attached. Grading only the output misses how the claim got there — and whether it can survive a legal or brand challenge after it ships.

Why a hostile critic catches what a friendly judge approves

A general-purpose LLM judge is aligned to be agreeable. Ask it "is this good marketing copy?" and it pattern-matches the polish — confident voice, clean structure, a strong hook — and returns approval. It is not looking for the load-bearing claim that has no source, because nothing in its training rewards refusing fluent text. That's the exact failure mode that ships an unprovable superlative to a paid channel.

OtterScore is aligned the other way: its reward is to find reasons to block, not reasons to flatter. It grades the copy AND its trajectory against your explicit acceptance policy — your brand voice rules, your claim-substantiation requirements, your mandatory disclaimers for each channel and jurisdiction. It treats every comparative claim, statistic, and superlative as guilty until cited, flags voice drift against your actual guide rather than a generic sense of "good," and checks that required legal lines are present, not just plausible-sounding.

The output is a decision, not a vibe: a score from 0.0 (must block) to 1.0 (ship), a band — ship, route_to_fix, quarantine, or block — located flaws tied to the specific sentence, and concrete upgrades. "Trusted by thousands" with no citation routes to fix; a missing financial-promotion disclaimer quarantines; a fabricated comparative stat blocks. Every verdict is signed audit evidence, so when legal or brand asks why a piece shipped, you have a record — not a screenshot of a chat that said it looked fine.

Grade it before it ships

It is agent-native — an agent can self-onboard and iterate to a passing band with no human in the loop. Canonical contract: /llms.txt.

# 1. get a free key (no human in the loop)
curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' -d '{"email":"you@example.com"}'
# -> { "api_key": "sk-otter-...", "free_quota": 25 }

# 2. grade your work (async -- tolerates the GPU cold-start)
curl -s https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{"submission":"async","user_prompt":"<what the work was for>",
       "artifact_parts":[{"mime_type":"text/plain","text":"<your work>"}]}'
# -> { "job_id": "...", "status": "queued" }

# 3. poll until completed (located flaws come from GET /api/v1/eval/runs/{run_id})
curl -s https://api.seaotter.ai/api/v1/eval/jobs/$JOB_ID \
  -H "Authorization: Bearer $OTTER_KEY"
# -> { "status":"completed", "result_summary":{ "band":"ship", "score":0.95 }, "run_id":"..." }

Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp. Bring your own rubric/policy so the gate enforces your bar.

Frequently asked questions

How do you grade AI marketing copy for unsupported claims?

Send the copy to the OtterScore eval API or the hosted MCP server with an acceptance policy that defines your claim-substantiation rules. The critic treats every comparative claim, statistic, and superlative as requiring a citation and flags any that the copy or its provided source material can't support. You get back a score from 0.0 to 1.0, a band, the exact sentence carrying the unsupported claim, and a concrete upgrade to fix it.

Can it check copy against our specific brand voice and disclaimers?

Yes. OtterScore is bring-your-own-policy: you supply a rubric encoding your brand voice rules, mandatory disclaimers per channel and jurisdiction, and prohibited claim types. The critic grades against that policy rather than a generic notion of good copy, so it catches voice drift and missing legal lines that a default model would wave through. Policies can be passed inline per request or stored and reused.

How is this different from asking ChatGPT to review the copy?

A general LLM is aligned to be helpful and agreeable, so it tends to approve fluent, on-brand-sounding text and rarely refuses a confident claim. OtterScore is a hostile-by-default critic — its reward is to find reasons to block, not to flatter — and it grades both the final copy and the trajectory that produced it. That inversion is what catches the invented statistic and the dropped disclaimer instead of complimenting the hook.

What does a verdict include and is it auditable?

Each verdict returns a numeric score from 0.0 (must block) to 1.0 (ship), a band (ship, route_to_fix, quarantine, or block), located flaws tied to specific sentences, and concrete upgrades to fix them. Every verdict is recorded as signed audit evidence, so when legal, brand, or compliance asks why a piece of copy shipped or was blocked, you have a defensible record rather than an informal sign-off.

Related: AI agent evaluation · AI agent quality gate · Evaluate AI customer support · Evaluate AI documents · Evaluate AI slide decks · live demo.