Evaluate AI Customer Support: Grade Agent Replies Pre-Send

A hostile-by-default critic that grades each support agent reply and its trajectory against your refund, account, and compliance policy — before it reaches a customer.

What breaks in AI support replies that reads as fine

The dangerous failures in support automation are fluent. A reply that opens with "I completely understand how frustrating this is" and closes with a clean next step reads like a strong answer — and quietly promises a refund the customer doesn't qualify for, issues a goodwill credit above the agent's authority, or commits to a 24-hour resolution your SLA doesn't cover. Tone is right, policy is wrong, and the customer now has a written promise you have to honor or walk back.

The second failure mode is confident inaccuracy about the account. The agent states a balance, a renewal date, a plan tier, or an order status that doesn't match the system of record — usually because it answered from the ticket text or a stale context window instead of the live account. A friendly reader skims it and approves; the customer acts on a wrong fact. Adjacent to this: compliance and regulatory statements asserted as fact ("you're fully covered," "this is GDPR-compliant," "we can cancel that with no fee") where the agent has no grounding and no authority to make the claim.

The third is the trajectory, not just the text. The final reply can be clean while the path to it was wrong — the agent skipped an identity-verification step before discussing account details, never checked the entitlement before offering the credit, or claimed a tool result it never actually retrieved. Grading the output alone passes these; the customer-visible damage is already baked in by the time the reply is fluent.

Why a hostile critic catches what a friendly LLM judge approves

A general-purpose LLM judge is aligned to be helpful and agreeable, so it rewards exactly the traits that make a bad support reply dangerous: empathy, confidence, and a tidy resolution. Ask it "is this a good reply?" and an empathetic, well-structured message that grants an off-policy refund scores high — the judge has no reason to look for the policy violation, because nothing in the reply looks like an error. OtterScore is aligned the opposite way: its reward function is to find reasons to block, so it reads the same reply against your refund matrix, your account-fact grounding requirements, and your compliance rules, and surfaces the violation as a located flaw rather than a vibe.

Because the verdict is bound to an explicit acceptance policy — your refund thresholds, which claims require system-of-record grounding, which statements an agent is never authorized to make — OtterScore returns a score from 0.0 to 1.0 and a band: ship, route_to_fix, quarantine, or block. An off-policy credit or an ungrounded compliance assertion lands in block or quarantine with the specific flaw and a concrete upgrade, not a soft suggestion. It grades the trajectory too, so a skipped verification step or a claimed-but-unretrieved tool result fails even when the final wording is flawless.

Every verdict is signed audit evidence, which matters in a support org where a refund or a compliance statement is a real liability. You get a record of what was blocked and why, tied to your policy version — defensible to a CX lead, a compliance reviewer, or an auditor. And because you bring your own rubric, the same critic enforces a tier-1 billing queue's rules differently from a regulated-account queue's, without retraining.

Grade it before it ships

It is agent-native — an agent can self-onboard and iterate to a passing band with no human in the loop. Canonical contract: /llms.txt.

# 1. get a free key (no human in the loop)
curl -s https://api.seaotter.ai/api/v1/agent-keys/signup \
  -H 'Content-Type: application/json' -d '{"email":"you@example.com"}'
# -> { "api_key": "sk-otter-...", "free_quota": 25 }

# 2. grade your work (async -- tolerates the GPU cold-start)
curl -s https://api.seaotter.ai/api/v1/eval/jobs \
  -H "Authorization: Bearer $OTTER_KEY" -H 'Content-Type: application/json' \
  -d '{"submission":"async","user_prompt":"<what the work was for>",
       "artifact_parts":[{"mime_type":"text/plain","text":"<your work>"}]}'
# -> { "job_id": "...", "status": "queued" }

# 3. poll until completed (located flaws come from GET /api/v1/eval/runs/{run_id})
curl -s https://api.seaotter.ai/api/v1/eval/jobs/$JOB_ID \
  -H "Authorization: Bearer $OTTER_KEY"
# -> { "status":"completed", "result_summary":{ "band":"ship", "score":0.95 }, "run_id":"..." }

Prefer MCP? Connect the hosted server by URL, no install: https://mcp.seaotter.ai/mcp. Bring your own rubric/policy so the gate enforces your bar.

Frequently asked questions

How do I evaluate AI customer support replies before they reach the customer?

Send each draft reply to the OtterScore eval API or the hosted MCP server (https://mcp.seaotter.ai/mcp) with your acceptance policy. You get back a score (0.0-1.0), a band (ship / route_to_fix / quarantine / block), the located flaws, and concrete upgrades. Replies that pass ship automatically; off-policy or ungrounded replies are routed to fix or held for human approval — so the gate runs inline before the customer sees anything. A free self-serve API key gets you started.

Can it catch off-policy refunds and credits that the reply itself makes sound reasonable?

Yes — this is the core case. You encode your refund matrix, credit-authority limits, and SLA commitments as the acceptance policy. OtterScore grades the reply against those rules and flags a promised refund the customer doesn't qualify for or a credit above the agent's authority, even when the wording is empathetic and confident. A friendly LLM judge typically approves those replies because they read well; a hostile critic is aligned to find the violation.

Does it check the agent's reasoning, or only the final reply text?

Both. OtterScore grades the output and its trajectory. It catches cases where the final reply is clean but the path was wrong — account details discussed before identity verification, a credit offered without checking entitlement, or a tool result the agent claimed but never actually retrieved. Grading the text alone passes these; grading the trajectory blocks them.

How do I encode our own refund, account, and compliance rules?

Bring your own rubric or policy. You define the acceptance criteria — refund thresholds, which account facts require system-of-record grounding, which compliance statements an agent is never authorized to make — and OtterScore grades every reply against it. Different queues can run different policies (a tier-1 billing queue versus a regulated-account queue) without retraining the critic, and each verdict is returned as signed audit evidence tied to your policy version.