Trace a failing job end to end

Start from the symptom — stuck workflow, Slack silence, a red Jobs row — and walk backward through Threads, Jobs, and Events until you know whether routing, tools, or credentials broke.

Plus: three pasted narrations — Admin Agent orders hypotheses from a job id + error snippet, compares webhook vs chat failures, and drafts a factual timeline for stakeholders without inventing facts.

Audience Admins · Developers
Time ~10 min
Prerequisites Permission to open **Audit Logs** ([View your audit logs](/tutorials/view-your-audit-logs) — admins see org-wide rows; others see scoped activity). At least one failure or stall you can reproduce or narrow to a time window. Helpful: a workflow ([Run a workflow](/tutorials/run-a-workflow)), intake ([Trigger a workflow with an intake webhook](/tutorials/trigger-a-workflow-with-an-intake-webhook)), or MCP-enabled agent ([Add an MCP server](/tutorials/add-an-mcp-server)) already in use so the walkthrough maps to real shapes.
You'll end up with A repeatable trace checklist — trigger → thread → jobs → events → provider health — plus language that separates **model errors**, **tool/MCP errors**, and **routing/offline** without guessing.

When a tutorial shows italic text in quotation marks, it usually mirrors a label or helper string inside Auxot. Product copy changes between releases — if something reads differently in your workspace, trust what you see on screen.

Callouts with a Worth knowing gold accent are meant as must-read context before you move on. Blockquotes that open with Tip are lighter, optional depth.

Why this matters

“Something failed” is the least actionable sentence in ops. Auxot already recorded what ran, where it entered, and which provider answered, but those facts live across Audit Logs tabs and System Health, not in one sentence.

Tracing end-to-end means you anchor on an observable symptom (failed job row, angry Slack thread, intake stuck on running), then pull Threads (conversation + source), Jobs (token routing + error body), and Events (cron fired? credential rotated?). You click rows and read detail panes; Admin Agent can narrate pasted snippets because you asked, not substitute for opening the receipt.

Nothing debugs itself: schedules and webhooks only fire because someone configured them. This tutorial is how you read the records Auxot already wrote.


Quick start

  1. Capture the symptom: exact time (timezone), user/channel/workflow name, and whether Chat, Slack, Discord, webhook, or cron started it.
  2. Open Audit Logs: Jobs tab → filter Failed (or Running stuck past SLA) inside the time window.
  3. Click the job row: read status, agent, model + provider, error text. Note the job id if search helps later.
  4. Jump to Threads: filter same window + source (webhook, cron, slack, etc.) → open the thread → confirm first user/tool message matches expectations.
  5. Scan Events: severity Error / Warning, types mentioning credentials, integrations, or cron.fired.
  6. Cross-check System Health: Take Auxot’s pulse in 10 seconds; provider offline vs cloud quota looks different than tool timeouts.

Done? You can name which layer failed (ingress, routing, model, tool/MCP, human step) and what to open next (Manage your Credentials, Define a tool policy, provider settings, workflow handoff).


The agent can do that?

Paste after you have one failing job row or error string (redacted tokens) so replies rank hypotheses instead of astrology.

1. Order hypotheses from a job snippet

Audit Logs → Jobs → failed row for agent "[name]" around [time TZ]. Error excerpt: [paste]. Thread source was [app/slack/webhook/cron]. Rank likely causes (provider offline vs tool vs prompt) and list the next three clicks in Auxot UI to confirm or kill each theory.

Why it’s non-obvious: Same HTTP 500 spans quota vs MCP timeout vs workflow handoff; pasted excerpt collapses the decision tree because you surfaced it.

2. Webhook vs chat: compare ingress shapes

Two failures: (A) intake POST returned 202 then poll stuck running; (B) Slack mention never answered. Same agent. What differs in Threads/Jobs filters I should apply before blaming the model?

Why it’s non-obvious: Webhook threads carry intake naming patterns (Trigger a workflow with an intake webhook); chat threads show message counts; mixing them wastes half an hour blaming GPT for Slack linkage.

3. Factual timeline draft

Draft a five-bullet incident timeline for leadership using only facts I'd pull from Audit Logs + Events — no vendor blame names — ending with one verification step owners take tomorrow.

Why it’s non-obvious: Exec updates need the concrete vocabulary Jobs already records; paste the ask; you still attach screenshots if compliance demands it.


Go deeper

Layers map (keep straight)

LayerWhere it shows up first
IngressThread source, Events (cron.fired, webhook errors)
Routing / providerJob row model + provider, failover behavior (Providers overview)
Model completionJob detail prompt/response, evaluator flags
Tool / MCPJob error referencing tool name, MCP timeout strings (Add an MCP server)
Human / workflowWorkflow column stuck on human step (Run a workflow)
Common failure signatures
  • Quota / cloud offline: provider errors, repeated retries; System Health cloud cards in an error state.
  • GPU / CLI worker missing: routing skipped local tier; heartbeat gaps (Connect a GPU worker, CLI records in Settings → Providers).
  • Credential revoked: Events around integration/auth; Jobs fail at tool boundary (Manage your Credentials).
  • Handoff typo: workflow tasks idle between columns though Jobs show completed agent steps; inspect workflow graph, not model logs.
Permissions reality

Scoped users may not see another teammate’s thread; empty lists aren’t proof nothing failed (View your audit logs → Worth knowing on scope). Escalate viewer role or have them paste rows when jointly debugging.


Walkthrough

Step 1: Freeze the window

Set Audit Logs time range to bracket the report (“after lunch EST”); wide beats narrow until you spot anchor job ids.

Step 2: Jobs-first when background work failed

Filter Failed. Click each suspicious row:

  • Does job type match chat vs tool vs workflow runner?
  • Does model + provider show unexpected cloud usage when you expected GPU?
  • Does error cite tool_use / MCP server name / HTTP status from an external API?

Copy verbatim error text into notes: you’ll paste into Admin Agent or tickets.

Step 3: Threads for conversational context

Switch tab → match timestamp + agent. Sources:

Open detail → read first failing user/tool message; prompts omitting required JSON shape cause loops that look like “AI stupidity.”

Step 4: Events for machinery

Filter Error. Look for credential rotations, integration disconnects, evaluation failures. Pair with Jobs timestamps; lead/lag tells the story (credential revoked → subsequent jobs fail).

Step 5: System Health cross-check

Providers offline ⇒ routing falls through to the next link in the failover chain (Providers overview). Fix infrastructure before rewriting prompts.

Step 6: Close the loop

Document: trigger, root layer, fix owner, verification job. Retry with intentional smaller prompt or isolated workflow task to prove recovery.


What’s next

Reference