Skip to main content
Skip to main content

Evaluate AI-Generated Slide Decks Before You Ship Them

A hostile-by-default critic that grades agent-built presentations on narrative, numeric consistency, and claim support — before a customer or board ever sees them.

What breaks in agent-generated decks that a quick read-through misses

The most expensive failure is the deck that looks finished. An agent produces clean layouts, consistent typography, and confident headlines — and underneath, the argument does not hold. The narrative arc is the first thing to go: each slide is locally fluent, but there is no through-line. Slide 4 says the problem is cost, slide 9 pitches a feature that addresses speed, and the ask ties to neither. A friendly reader skims the titles, sees plausible words, and signs off. The room reads it as a non-sequitur with a logo on it.

Numbers drift across slides because the agent assembled them from separate passes with no shared source of truth. The TAM is $4.2B on the market slide and $3.8B in the appendix. A headline reads 'up 40% YoY' while the chart beneath it plots a 28% rise. Pie-chart segments sum to 103%. The revenue projection on slide 11 silently contradicts the unit economics on slide 7. Each slide is internally fine; the deck as a whole is incoherent — and the one person in the room who does the arithmetic is the one you most need to convince.

Then there are the claims with nothing behind them. 'Industry-leading,' 'proven at scale,' '10x faster,' a market-size figure with no citation, a competitor comparison with no source. The agent generated text that pattern-matches a strong deck without ever grounding it. On a customer or board deck these are not stylistic — they are the lines that get challenged live, and an unsupported claim that gets caught in the room poisons trust in every other number you presented.

Why a hostile critic catches what a helpful LLM judge waves through

A friendly LLM judge is aligned to be agreeable: hand it a polished deck and it pattern-matches surface quality — clean slides, confident copy, a recognizable structure — and returns approval. That is exactly the failure mode that ships a broken deck, because the deck's problems are not on any single slide. They live in the relationships between slides: the number on 7 versus the number on 11, the problem on 4 versus the ask on 14. A judge that scores slides one at a time, and is looking for reasons to approve, will never see them.

OtterScore is aligned the other way: its reward is to find a reason to block. It grades the deck against an explicit acceptance policy — your rubric for narrative coherence, numeric consistency, and claim support — and reads the deck as a whole rather than slide by slide, cross-referencing figures across slides, tracing the argument from problem to ask, and flagging every claim that has no grounding. It returns a score from 0.0 to 1.0, a band — ship, route_to_fix, quarantine, or block — the located flaws (which slide, which number, which unsupported line), and concrete upgrades. When the deck is the output of a deck-building agent, OtterScore can also grade that agent's trajectory — the steps it took to assemble the deck — not just the final file. Every verdict is signed audit evidence, so 'this deck was reviewed against policy' is a record, not a vibe.

Because it is hostile by default, a clean-looking deck does not earn a pass — it has to survive the critic. A 0.0 on a contradicting projection or an uncited TAM is the point: you would rather the block come from the API than from the prospect across the table.

Grade it before it ships

It is agent-native — an agent can self-onboard and iterate to a passing band with no human in the loop. Canonical contract: /llms.txt.

# 1. get a free key (no human in the loop)
curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' -d '{"email":"you@example.com"}'
# -> { "api_key": "sk-otter-...", "free_quota": 25 }

# 2. grade your work (async -- tolerates the GPU cold-start)
curl -s https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{"submission":"async","user_prompt":"<what the work was for>",
       "artifact_parts":[{"mime_type":"text/plain","text":"<your work>"}]}'
# -> { "job_id": "...", "status": "queued" }

# 3. poll until completed (located flaws come from GET /api/v1/eval/runs/{run_id})
curl -s https://api.seaotter.ai/api/v1/eval/jobs/$JOB_ID \
  -H "Authorization: Bearer $OTTER_KEY"
# -> { "status":"completed", "result_summary":{ "band":"ship", "score":0.95 }, "run_id":"..." }

Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp. Bring your own rubric/policy so the gate enforces your bar.

Frequently asked questions

How do I check an AI-generated deck for inconsistent numbers across slides?

Send the deck to OtterScore's eval API or the hosted MCP server with a rubric that requires numeric consistency. The critic cross-references figures across every slide — TAM, growth rates, projections, chart percentages — and flags contradictions like a $4.2B market on one slide and $3.8B on another, or a '40% YoY' headline over a chart that plots 28%. It returns the specific slides and values in conflict, plus a score and band, not just a pass/fail.

Can it tell whether a deck's narrative actually holds together?

Yes. OtterScore reads the deck as a whole rather than scoring slides in isolation, so it traces the argument from problem statement to the ask and flags where the through-line breaks — for example a problem framed as cost on slide 4 and a solution pitched on speed on slide 9. A friendly LLM judge that scores slides one at a time misses this; reading the deck end to end is the point.

Will it flag claims that have no support?

That is a core check. The critic locates unsupported claims — 'industry-leading,' '10x faster,' uncited market sizes, competitor comparisons with no source — and reports each one with the slide it appears on and a concrete upgrade, such as adding a citation or softening to what the evidence supports. These are the lines that get challenged live in front of a board or customer.

How do I run this in an agent workflow that produces decks?

Call the OtterScore eval API or the hosted MCP server (https://mcp.seaotter.ai/mcp) as a gate after your deck-building agent finishes. You get a score (0.0–1.0), a band — ship, route_to_fix, quarantine, or block — located flaws, and upgrades, so the agent can route a failing deck back to be fixed before a human ever sees it. A free self-serve API key gets you started, and you can bring your own rubric or policy to encode your team's acceptance standard.

Related: AI agent evaluation · AI agent quality gate · Evaluate AI customer support · Grade AI marketing copy · Evaluate AI documents · live demo.