Save rubric-scored golden tests for your agents

Stand up an **eval harness partner** — freeze **golden prompts** plus **shape rules** and **scored rubric dimensions** — rerun after brain or catalog moves — so regressions show up as **numbers and misses**, not vibes — extends binary packs ([Catch regressions after you change an agent](/tutorials/catch-regressions-after-you-change-an-agent)) without extra tooling.

Plus: three Admin-Agent passes — promote three messy **production chat** snippets into golden rows with rubric text, **score two pasted answers** blind — same prompt — pick winner with receipts, and inventory **suite coverage** across five agents — flag **MISSING_SUITE**.

Audience Admins · Developers
Time ~12 min
Prerequisites You already rerun prompts after edits ([Catch regressions after you change an agent](/tutorials/catch-regressions-after-you-change-an-agent)) — or you intend to. Target agents exist ([Create an agent from scratch](/tutorials/create-an-agent-from-scratch), [Give your agent its job description](/tutorials/give-your-agent-its-job-description)). **Audit Logs → Jobs** comfort ([View your audit logs](/tutorials/view-your-audit-logs)). Helpful: [Run health checks on your must-not-fail agents](/tutorials/run-health-checks-on-your-must-not-fail-agents) when checks stay binary — this lesson adds scoring layers.
You'll end up with One **Eval steward** agent charter — markdown table columns **Prompt / Must include / Must not / Rubric / Pass threshold** — plus one **versioned suite file** stored beside your directory ([Build your agent directory](/tutorials/build-your-agent-directory)) — **you** run rows in Chat — nothing auto-judges production traffic unattended.

When a tutorial shows italic text in quotation marks, it usually mirrors a label or helper string inside Auxot. Product copy changes between releases — if something reads differently in your workspace, trust what you see on screen.

Callouts with a Worth knowing gold accent are meant as must-read context before you move on. Blockquotes that open with Tip are lighter, optional depth.

Why this matters

Binary pass/fail packs (Catch regressions after you change an agent) catch outright breakage. They struggle when answers stay technically allowed but quality slips: vague tone, missing citations, and soft policy drift.

Golden tests here mean saved prompts plus expected shape: headers you require, phrases that must never appear, and three to five rubric dimensions (accuracy vs sources, brevity, and tool discipline) scored after you paste the candidate answer back into eval Chat.

This is not a replacement for adversarial probing (Stress-test an agent before you widen access): rubrics reward intended behavior; stress tests hunt predator-shaped asks.

Auxot does not grade customer threads silently. You execute prompts: you paste responses into the scorer; scheduled reminders (Run scheduled canary checks on production agents, Run an agent on a schedule) only nag you to rerun suites you authored.

Measurement belongs in versioned tables — judgment stays human.


Quick start

  1. Pick one tier-one agent — narrow scope; three prompts max for pilot (Run health checks on your must-not-fail agents mindset).
  2. Mint Eval steward — charter forbids inventing scores without pasted answer text; outputs SCORECARD table only when both prompt and answer exist.
  3. Author three golden rowsMust include lists literal header strings or citation patterns; Must not lists forbidden shortcuts (Define a tool policy when tools drift).
  4. Baseline run — execute prompts against target agent; paste answers back; score; record Last scored date beside row (Catch regressions after you change an agent habit).
  5. Attach to migration — checkbox beside provider moves (Migrate agents when models or providers change); rerun suite before widening rollout.

Done? Version note at top of suite file: SUITE v1 → v2 when rubric text changes; stale scores discarded consciously.


The agent can do that?

1. Promote chat snippets into golden rows

Chat → Admin Agent:

Redacted chat excerpts (three): […]. Target agent job summary: […]. Emit markdown table — columns Prompt / Must include / Must not / Rubric (numbered dims 1-5 each) / Pass threshold — max three rows — refuse fabrication beyond excerpts.

Why it’s non-obvious: Synthetic prompts lie; production-shaped text anchors rubrics after you paste real noise.

2. Blind score two answers

Same prompt: […]. Answer A: […]. Answer B: […]. Rubric: [paste]. Score both — dimension-by-dimension — then **RECOMMENDATION** line — cite quoted spans — no ties allowed — if within noise band say **HUMAN_DECIDE**.

Why it’s non-obvious: Model shootouts need disciplined comparison (Pick the right model for the job); blind ordering reduces halo bias.

3. Fleet suite inventory

Agent names (paste directory): […]. For each — output **HAS_SUITE | MISSING_SUITE | STALE_LAST_SCORED** — infer STALE when date blank or >90d placeholder — markdown — no apologies essay.

Why it’s non-obvious: Coverage visibility prevents hero agents with zero tests (Run a quarterly review of your agents).


Go deeper

Health checks stay

Keep cheap binary monitors (Run health checks on your must-not-fail agents); layer rubrics only where subjectivity hurts revenue or trust.

Canary overlap

Frozen prompts on a clock (Run scheduled canary checks on production agents) reuse same golden rows; change only wrapper instructions; one source of truth.

Stress vs eval

Run Stress-test an agent before you widen access before widening blast radius; run golden suites weekly or on every instruction edit; different fear profiles.

Tools and credentials

When rubric includes tool discipline, align policies (Define a tool policy, Manage your Credentials); scores finger symptoms; settings fix root causes.


Walkthrough

Step 1: Freeze SUITE v1 header

Note owner, agent id slug, and review cadence; date stamp; store beside wiki or repo doc.

Step 2: Write rubric dimensions plainly

Avoid clever jargon; reviewers rotate; fifth dimension optional Brand voice pointing at context (Add your first context file).

Step 3: Dry-run scorer alone

Paste nonsense answer; confirm steward refuses or scores all ones; tighten charter.

Step 4: Regression tie-in

After instruction edit, rerun failed dimensions only workflow; log Jobs ids (View your audit logs).

Step 5: Retire rows

Delete prompts that no longer match product; archive row in changelog; version bump SUITE v2.


What’s next

Reference