Catch regressions after you change an agent
Freeze a **mini regression pack** — a handful of prompts plus plain-English pass rules — and rerun it whenever job descriptions, models, providers, or tool policies shift, so regressions show up as failed checks instead of silent weirdness in prod chat.
Plus: three Admin-Agent passes — generate a pack from an agent’s stated job alone, compare before/after Jobs excerpts after a model swap, and decide what belongs here versus a full stress pass ([Stress-test an agent before you widen access](/tutorials/stress-test-an-agent-before-you-widen-access)).
| Audience | Admins · Developers |
|---|---|
| Time | ~10 min |
| Prerequisites | Agents you maintain ([Create an agent from scratch](/tutorials/create-an-agent-from-scratch), [Give your agent its job description](/tutorials/give-your-agent-its-job-description)). You know where **Audit Logs** → **Jobs** lives ([View your audit logs](/tutorials/view-your-audit-logs)). Helpful: [Update your agents without breaking the team](/tutorials/update-your-agents-without-breaking-the-team) (how you announce brain changes), [Migrate agents when models or providers change](/tutorials/migrate-agents-when-models-or-providers-change), [Pick the right model for the job](/tutorials/pick-the-right-model-for-the-job). |
| You'll end up with | A written regression table (prompt → pass rule → last green date) stored somewhere durable — plus a rerun habit tied to migrations and instruction edits, with receipts checked in **Jobs**. |
When a tutorial shows italic text in quotation marks, it usually mirrors a label or helper string inside Auxot. Product copy changes between releases — if something reads differently in your workspace, trust what you see on screen.
Callouts with a Worth knowing gold accent are meant as must-read context before you move on. Blockquotes that open with Tip are lighter, optional depth.
Why this matters
Agents don’t ship like binaries with green CI lights. You tweak one paragraph in a job description, bump a model preset, or widen a tool policy: and last week’s perfect answer becomes this week’s confident mistake. Most teams notice only when someone screenshots the wrong reply.
A regression pack is the smallest honest defense:
- Five-ish prompts that represent real work: not trivia, not red-team chaos.
- Pass rules a human can judge in seconds: must cite policy X, must refuse Y, must call tool Z.
- A rerun trigger: same pack after every material change (Update your agents without breaking the team, Migrate agents when models or providers change).
Auxot doesn’t ship a dedicated “eval suite” button today: the habit is documentation + Chat + Audit Logs. Jobs remain your receipts (View your audit logs); threading the same prompts keeps comparisons fair.
Nothing reruns because Quality wished it: you open Chat, you tick the table, you log the date when everything still passes.
Quick start
- Pick one agent worth guarding — customer-facing, billing-adjacent, or heavy tool use (Define a tool policy).
- Write five prompts — copy real questions users asked last month (sanitized) or ask Admin Agent to propose them from the job description (power move 1 below).
- Add pass rules — one line each: tone, required refusal, tool usage, and facts that must appear.
- Store the pack — context file, wiki page, or repo markdown: include Last green: YYYY-MM-DD row after each full pass.
- Rerun on change — model migration (Migrate agents when models or providers change), instruction rewrite (Give your agent its job description), provider order experiments (Read provider routing like an operator), and MCP edits (Add an MCP server). Skim Jobs for new failures (Trace a failing job end to end when it’s messy).
Done? One agent has an explicit regression doc: boring insurance against exciting edits.
The agent can do that?
1. Draft prompts from job description alone
Chat → Admin Agent:
Agent "[name]" — job summary (paste from Settings): […]. Propose five regression prompts covering normal happy paths only — no adversarial tricks. For each: prompt text + one-sentence pass rule + note if it expects a tool call.
Why it’s non-obvious: Humans remember dramatic failures, not boring success paths: Admin Agent balances coverage after you paste the role text.
2. Compare Jobs before/after a model change
Paste two anonymized job snippets (before/after timestamps):
Same regression prompt before/after model swap. Before job excerpt: [paste]. After: [paste]. Did latency/tool/error profile worsen in ways that matter for SLA?
Why it’s non-obvious: Cheaper models trade quality: receipts beat hunches (Migrate agents when models or providers change).
3. Boundary between regression vs stress
Classify these scenarios — belong in mini regression pack vs belong in stress testing ([Stress-test an agent before you widen access](/tutorials/stress-test-an-agent-before-you-widen-access)): [list].
Why it’s non-obvious: Regression packs should stay fast enough to run weekly; adversarial suites belong elsewhere.
Go deeper
Automation hooks
- Run prompts manually in Chat: simplest.
- Optional: schedule the same wording via Run an agent on a schedule when Discord delivery fits.
- Optional: drive intake/workflow smoke from CI (Trigger a workflow from GitHub Actions): heavier; only when HTTP orchestration already exists.
Version the pack
When prompts change, bump pack version in the doc: correlates with Connect Discord to your agents changelog rows so nobody rehearses stale questions.
Many agents
For many agents, maintain one pack per tier-one agent: don’t boil the ocean; Run a quarterly review of your agents is where you audit which packs actually exist.
Markdown table template
| ID | Prompt | Pass rule | Expect tools? | Last green |
|---|---|---|---|---|
| R1 | … | … | yes/no | 2026-05-01 |
Walkthrough
Step 1: Choose scope
One agent per pack v1: prove the habit before templating ten.
Step 2: Harvest real prompts
Pull from Slack threads (Connect Slack to your agents), support transcripts, or your own Chat history: realism beats synthetic fluff.
Step 3: Encode pass rules
If a rule needs subjective judgment (“sounds on-brand”), name the reference doc (Add your first context file).
Step 4: Baseline run
Execute every row in Chat with the agent selected; mark Last green when all pass.
Step 5: Attach to change management
Add “Regression pack rerun” as a checkbox beside provider migrations (Migrate agents when models or providers change), major instruction edits (Connect Discord to your agents), and tool-policy changes (Try a new tool before your agents depend on it).
Step 6: When something fails
Open Audit Logs → Jobs, trace layers (Trace a failing job end to end), fix instructions or policy, rerun only failed rows before clearing Last green.
What’s next
- → Save rubric-scored golden tests for your agents. Use it when binary pass/fail hides quality regressions: adds scored dimensions and shape rules with the same rerun habit.
- → Batch spreadsheet rows through a workflow. Same pack discipline at row scale: stable
row_key, intake loops, logs you can grep. - → Migrate agents when models or providers change. Use it when regression packs earn their keep.
- → Update your agents without breaking the team. Announce changes humans feel; packs catch changes humans don’t.
- → View your audit logs. Receipts for each rerun.
- → Run scheduled canary checks on production agents. Rerun pack rows on a schedule: regressions without a tracked Settings edit still surface.
- → Stress-test an agent before you widen access. Adversarial coverage beyond happy-path regression.
- → Run a quarterly review of your agents. Roster-wide audit of which packs exist.
Reference
- Pages in Auxot: Chat (reruns), Settings → Agents (what changed), Audit Logs → Jobs (evidence)
- See also: Trigger a workflow from GitHub Actions, Batch spreadsheet rows through a workflow, Give your agent its job description, Pick the right model for the job, Trace a failing job end to end