Save quality rubrics for your agents
Stand up a **rubric scoring agent**. Freeze saved prompts plus expected-shape rules and scored rubric dimensions. Rerun after a model swap or instruction change. Regressions show up as scores and missed criteria, not as a feeling that something is off. Extends binary regression packs ([Catch regressions after you change an agent](/tutorials/catch-regressions-after-you-change-an-agent)) without extra tooling.
Plus: three Admin-Agent passes: promote three messy **production chat** snippets into scoring rows with rubric text, **score two pasted answers** blind on the same prompt and pick a winner with receipts, and inventory **suite coverage** across five agents (flag **MISSING_SUITE**).
| Audience | Admins · Developers |
|---|---|
| Time | ~12 min |
| Prerequisites | You already rerun prompts after edits ([Catch regressions after you change an agent](/tutorials/catch-regressions-after-you-change-an-agent)), or you intend to. Target agents exist ([Create an agent from scratch](/tutorials/create-an-agent-from-scratch), [Give your agent its job description](/tutorials/give-your-agent-its-job-description)). **Audit Logs → Jobs** comfort ([View your audit logs](/tutorials/view-your-audit-logs)). Helpful: [Run health checks on your must-not-fail agents](/tutorials/run-health-checks-on-your-must-not-fail-agents) when checks stay binary. This lesson adds scoring layers. |
| You'll end up with | One **rubric scoring agent** with markdown-table output: columns **Prompt / Must include / Must not / Rubric / Pass threshold**. Plus one **versioned suite file** stored beside your directory ([Build your agent directory](/tutorials/build-your-agent-directory)). **You** run rows in Chat. Nothing auto-judges production traffic unattended. |
When a tutorial shows italic text in quotation marks, it usually mirrors a label or helper string inside Auxot. Product copy changes between releases — if something reads differently in your workspace, trust what you see on screen.
Callouts with a Worth knowing gold accent are meant as must-read context before you move on. Blockquotes that open with Tip are lighter, optional depth.
Why this matters
Binary pass/fail packs (Catch regressions after you change an agent) catch outright breakage. They struggle when answers stay technically allowed but quality slips: vague tone, missing citations, and soft policy violations that don’t trip a hard rule.
A quality rubric here means saved prompts plus expected shape: headers you require, phrases that must never appear, and three to five rubric dimensions (accuracy vs sources, brevity, and tool discipline) scored after you paste the candidate answer back into Chat.
This is not a replacement for adversarial probing (Stress-test an agent before you widen access). Rubrics measure how close an answer is to the intended behavior. Stress tests probe for adversarial inputs designed to break the agent.
Auxot does not grade customer threads silently. You execute prompts. You paste responses into the scorer. Scheduled reminders (Run scheduled canary checks on production agents, Run an agent on a schedule) only nag you to rerun suites you authored.
Measurement belongs in versioned tables. Judgment stays human.
Quick start
- Pick one tier-one agent. Narrow scope, three prompts max for pilot (Run health checks on your must-not-fail agents mindset).
- Create the rubric scoring agent. Instructions forbid inventing scores without pasted answer text. Outputs a SCORECARD table only when both prompt and answer exist.
- Author three scoring rows. Must include lists literal header strings or citation patterns. Must not lists forbidden shortcuts (Define a tool policy when your tool list changes).
- Baseline run. Execute prompts against target agent, paste answers back, score, record Last scored date beside row (Catch regressions after you change an agent habit).
- Attach to migration. Checkbox beside provider moves (Migrate agents when models or providers change). Rerun suite before widening rollout.
Done? Version note at top of suite file: SUITE v1 → v2 when rubric text changes. Stale scores get tossed on purpose, not by accident.
The agent can do that?
1. Promote chat snippets into scoring rows
Chat → Admin Agent:
Redacted chat excerpts (three): […]. Target agent job summary: […]. Emit markdown table with columns Prompt / Must include / Must not / Rubric (numbered dims 1-5 each) / Pass threshold. Max three rows. Refuse fabrication beyond excerpts.
Why it’s non-obvious: Made-up prompts don’t surface the same failure modes real traffic does. Pasting redacted production-shaped text anchors the rubric in actual usage.
2. Blind score two answers
Same prompt: […]. Answer A: […]. Answer B: […]. Rubric: [paste]. Score both dimension-by-dimension, then a **RECOMMENDATION** line. Cite quoted spans. No ties allowed. If within noise band say **HUMAN_DECIDE**.
Why it’s non-obvious: Comparing two models head-to-head needs a disciplined scoring approach (Pick the right model for the job). Blind ordering reduces halo bias.
3. Suite inventory across your agents
Agent names (paste directory): […]. For each, output **HAS_SUITE | MISSING_SUITE | STALE_LAST_SCORED**. Infer STALE when date blank or >90d placeholder. Markdown, no apologies essay.
Why it’s non-obvious: Coverage visibility prevents the situation where one critical agent has no rubric because nobody noticed (Run a quarterly review of your agents).
Go deeper
Health checks stay
Keep cheap binary monitors (Run health checks on your must-not-fail agents). Layer rubrics only where subjectivity hurts revenue or trust.
Canary overlap
Frozen prompts on a clock (Run scheduled canary checks on production agents) reuse the same scoring rows. Change only wrapper instructions. One source of truth.
Stress vs eval
Run Stress-test an agent before you widen access before widening who can reach the agent. Run scoring suites weekly or after every instruction edit. The two suites cover different risks: adversarial inputs versus steady-state quality.
Tools and credentials
When the rubric includes tool discipline, align policies (Define a tool policy, Manage your Credentials). Scores point at symptoms; settings fix root causes.
Walkthrough
Step 1: Freeze SUITE v1 header
Note owner, agent id slug, and review cadence; date stamp; store beside wiki or repo doc.
Step 2: Write rubric dimensions plainly
Avoid clever jargon; reviewers rotate; fifth dimension optional Brand voice pointing at context (Add your first context file).
Step 3: Dry-run scorer alone
Paste nonsense answer; confirm the agent refuses or scores all ones; tighten instructions.
Step 4: Regression tie-in
After instruction edit, rerun failed dimensions only workflow; log Jobs ids (View your audit logs).
Step 5: Retire rows
Delete prompts that no longer match product; archive row in changelog; version bump SUITE v2.
What’s next
- → Catch regressions after you change an agent. Binary packs first; this lesson layers numeric rubrics; same rerun triggers.
- → Fix when requests go to the wrong agent. When Router probes live in your suite, iterate instructions with matrices instead of gut feel.
- → Run scheduled canary checks on production agents. Automate reruns; keep scoring rows frozen inside cron prompts.
- → Run health checks on your must-not-fail agents. Pair cheap PASS/FAIL probes with deeper scoring where warranted.
- → Migrate agents when models or providers change. Rerun scored suites on migration checklists; score shifts expose capability gaps early.
- → Stress-test an agent before you widen access. Adversarial coverage; run before expanding audience; scoring suites track steady-state quality.
Reference
- Pages in Auxot: Chat, Settings → Agents, Audit Logs → Jobs, and optional Settings → Providers
- See also: Automate weekly checkups on your agents, Fix when requests go to the wrong agent, View your audit logs, Build your agent directory, Update your agents without breaking the team, Pick the right model for the job, Run a quarterly review of your agents, Run an agent on a schedule