Save quality rubrics for your agents

Stand up a **rubric scoring agent**. Freeze saved prompts plus expected-shape rules and scored rubric dimensions. Rerun after a model swap or instruction change. Regressions show up as scores and missed criteria, not as a feeling that something is off. Extends binary regression packs ([Catch regressions after you change an agent](/tutorials/catch-regressions-after-you-change-an-agent)) without extra tooling.

Plus: three Admin-Agent passes: promote three messy **production chat** snippets into scoring rows with rubric text, **score two pasted answers** blind on the same prompt and pick a winner with receipts, and inventory **suite coverage** across five agents (flag **MISSING_SUITE**).

Audience Admins · Developers
Time ~12 min
Prerequisites You already rerun prompts after edits ([Catch regressions after you change an agent](/tutorials/catch-regressions-after-you-change-an-agent)), or you intend to. Target agents exist ([Create an agent from scratch](/tutorials/create-an-agent-from-scratch), [Give your agent its job description](/tutorials/give-your-agent-its-job-description)). **Audit Logs → Jobs** comfort ([View your audit logs](/tutorials/view-your-audit-logs)). Helpful: [Run health checks on your must-not-fail agents](/tutorials/run-health-checks-on-your-must-not-fail-agents) when checks stay binary. This lesson adds scoring layers.
You'll end up with One **rubric scoring agent** with markdown-table output: columns **Prompt / Must include / Must not / Rubric / Pass threshold**. Plus one **versioned suite file** stored beside your directory ([Build your agent directory](/tutorials/build-your-agent-directory)). **You** run rows in Chat. Nothing auto-judges production traffic unattended.

When a tutorial shows italic text in quotation marks, it usually mirrors a label or helper string inside Auxot. Product copy changes between releases — if something reads differently in your workspace, trust what you see on screen.

Callouts with a Worth knowing gold accent are meant as must-read context before you move on. Blockquotes that open with Tip are lighter, optional depth.

Why this matters

Binary pass/fail packs (Catch regressions after you change an agent) catch outright breakage. They struggle when answers stay technically allowed but quality slips: vague tone, missing citations, and soft policy violations that don’t trip a hard rule.

A quality rubric here means saved prompts plus expected shape: headers you require, phrases that must never appear, and three to five rubric dimensions (accuracy vs sources, brevity, and tool discipline) scored after you paste the candidate answer back into Chat.

This is not a replacement for adversarial probing (Stress-test an agent before you widen access). Rubrics measure how close an answer is to the intended behavior. Stress tests probe for adversarial inputs designed to break the agent.

Auxot does not grade customer threads silently. You execute prompts. You paste responses into the scorer. Scheduled reminders (Run scheduled canary checks on production agents, Run an agent on a schedule) only nag you to rerun suites you authored.

Measurement belongs in versioned tables. Judgment stays human.


Quick start

  1. Pick one tier-one agent. Narrow scope, three prompts max for pilot (Run health checks on your must-not-fail agents mindset).
  2. Create the rubric scoring agent. Instructions forbid inventing scores without pasted answer text. Outputs a SCORECARD table only when both prompt and answer exist.
  3. Author three scoring rows. Must include lists literal header strings or citation patterns. Must not lists forbidden shortcuts (Define a tool policy when your tool list changes).
  4. Baseline run. Execute prompts against target agent, paste answers back, score, record Last scored date beside row (Catch regressions after you change an agent habit).
  5. Attach to migration. Checkbox beside provider moves (Migrate agents when models or providers change). Rerun suite before widening rollout.

Done? Version note at top of suite file: SUITE v1 → v2 when rubric text changes. Stale scores get tossed on purpose, not by accident.


The agent can do that?

1. Promote chat snippets into scoring rows

Chat → Admin Agent:

Redacted chat excerpts (three): […]. Target agent job summary: […]. Emit markdown table with columns Prompt / Must include / Must not / Rubric (numbered dims 1-5 each) / Pass threshold. Max three rows. Refuse fabrication beyond excerpts.

Why it’s non-obvious: Made-up prompts don’t surface the same failure modes real traffic does. Pasting redacted production-shaped text anchors the rubric in actual usage.

2. Blind score two answers

Same prompt: […]. Answer A: […]. Answer B: […]. Rubric: [paste]. Score both dimension-by-dimension, then a **RECOMMENDATION** line. Cite quoted spans. No ties allowed. If within noise band say **HUMAN_DECIDE**.

Why it’s non-obvious: Comparing two models head-to-head needs a disciplined scoring approach (Pick the right model for the job). Blind ordering reduces halo bias.

3. Suite inventory across your agents

Agent names (paste directory): […]. For each, output **HAS_SUITE | MISSING_SUITE | STALE_LAST_SCORED**. Infer STALE when date blank or >90d placeholder. Markdown, no apologies essay.

Why it’s non-obvious: Coverage visibility prevents the situation where one critical agent has no rubric because nobody noticed (Run a quarterly review of your agents).


Go deeper

Health checks stay

Keep cheap binary monitors (Run health checks on your must-not-fail agents). Layer rubrics only where subjectivity hurts revenue or trust.

Canary overlap

Frozen prompts on a clock (Run scheduled canary checks on production agents) reuse the same scoring rows. Change only wrapper instructions. One source of truth.

Stress vs eval

Run Stress-test an agent before you widen access before widening who can reach the agent. Run scoring suites weekly or after every instruction edit. The two suites cover different risks: adversarial inputs versus steady-state quality.

Tools and credentials

When the rubric includes tool discipline, align policies (Define a tool policy, Manage your Credentials). Scores point at symptoms; settings fix root causes.


Walkthrough

Step 1: Freeze SUITE v1 header

Note owner, agent id slug, and review cadence; date stamp; store beside wiki or repo doc.

Step 2: Write rubric dimensions plainly

Avoid clever jargon; reviewers rotate; fifth dimension optional Brand voice pointing at context (Add your first context file).

Step 3: Dry-run scorer alone

Paste nonsense answer; confirm the agent refuses or scores all ones; tighten instructions.

Step 4: Regression tie-in

After instruction edit, rerun failed dimensions only workflow; log Jobs ids (View your audit logs).

Step 5: Retire rows

Delete prompts that no longer match product; archive row in changelog; version bump SUITE v2.


What’s next

Reference