Sandboxing AI Agents in Production: The Security Checklist Before You Deploy
Five real agent security incidents in one week. The practical checklist for sandboxing, permission scoping, and egress controls before your agents go live.
This week was a stress test. Not a simulation — actual production incidents with actual consequences.
Researchers disclosed that a €0.01 bank transfer could turn a financial AI assistant into a phishing delivery system. An AI agent ran amok on Fedora’s infrastructure, causing unintended system changes. A blog post documented an AI agent that autonomously initiated thousands of API calls during a network scan and bankrupted its operator. And that’s before accounting for the Microsoft open source tooling compromise that handed attackers a path to developer credentials via the AI toolchain.
Five incidents. One week. None of them theoretical.
If your team is evaluating AI agents for production deployment, the question is not whether these failure modes apply to you. It is whether you have built anything to stop them before they happen.
What’s Different About Agent Security
Most AI security conversations focus on model outputs: jailbreaks, hallucinations, inappropriate content. Those are real concerns, but they are the wrong frame for production agents.
The actual risk with agents is that they have real access. They call APIs, write to databases, send emails, execute code, and spawn subprocesses. When a model says something wrong, a human reads it and corrects it. When an agent does something wrong, it may have already transferred money, deleted records, exfiltrated data, or burned through your API budget before anyone notices.
The attack surface is not the model. It is the gap between what the agent is permitted to do and what it should do in any given context.
Prompt injection — where malicious content in the agent’s environment hijacks its behavior — is documented as the primary failure mechanism in production agentic systems. OWASP has flagged it. Security researchers at Blue41 published a real proof-of-concept against a financial AI assistant. Cline’s CI/CD postmortem showed how a compromised npm package gave attackers shell access through AI agents running in GitHub Actions. Prompt injection is not patched by switching models. You patch it by limiting what the agent can do even when it has been successfully hijacked.
Here is a practical checklist to work through before your next agent goes live.
1. Principle of Least Privilege: Scope the Tools, Not Just the Prompt
Every tool you hand an agent is a potential blast radius. If an agent can read files, it may read files it should not. If it can call APIs, it may call APIs you did not intend.
Before deploying any agent:
- Enumerate every tool the agent has access to. Write this down. If you cannot list them all, the agent has too much access.
- Ask whether each tool is required for the specific task. Cut anything that is not.
- Scope credentials to minimum permissions. If the agent only needs to read from one database table, its credentials should have SELECT on that table — not full schema access.
- Avoid long-lived standing credentials. Issue short-lived, scoped tokens at invocation time rather than embedding permanent OAuth grants in the agent’s configuration.
The bunq incident makes this concrete: the AI assistant’s tools were not scoped tightly enough to prevent a maliciously crafted bank transfer description from triggering downstream phishing behavior. The agent had the capability. The attacker supplied the instruction via indirect prompt injection. Least privilege breaks that chain.
The same logic applies to file system access, email send permissions, calendar write access, and any other action your agent can take with real-world effects.
2. Egress Controls: Know Exactly Where the Agent Can Go
An agent that can make outbound HTTP requests to arbitrary URLs is an agent that can exfiltrate data, reach command-and-control endpoints, or be redirected by an injected payload. Cline’s postmortem stated it plainly: giving an LLM shell access in a CI context where it processes untrusted input is functionally equivalent to giving every GitHub user shell access.
Enforce outbound network controls before deployment:
- Allowlist outbound domains explicitly. If your agent needs to call your CRM and your document storage API, it should reach exactly those two domains and nothing else by default.
- Use an enforcement proxy, not just logging. Logging that an agent made an unauthorized outbound call after the fact does not help. The enforcement point needs to be in-path on every outbound request.
- Block all egress for agents that do not require internet access. An internal document assistant that reads your company knowledge base does not need to reach the public internet. Default-deny is the right posture.
AWS published a pattern for per-domain allowlisting using network policies at the agent egress layer. The principle is straightforward. The discipline to actually implement it before go-live is what most teams skip.
3. Runtime Sandboxing: Isolate What the Agent Can Touch
If an agent executes code — or runs in an environment where code execution is possible — isolation is not optional. When a Claude Code cleanup script was instructed incorrectly, it wiped a user’s home directory. That was not a model failure. It was an infrastructure failure caused by running agent-generated code directly on the host filesystem.
Sandboxing options on a spectrum from easier to more thorough:
- Filesystem isolation. The agent sees only its designated working directory. Path traversal attempts are blocked at the filesystem layer, not by trusting the model to avoid them.
- Ephemeral container isolation. Agents run in short-lived containers with a defined set of mounted volumes and no access to host paths or host networking beyond what is explicitly granted.
- gVisor or MicroVM isolation for high-risk workloads. For agents that execute arbitrary code or process untrusted user input, use a security-hardened runtime (gVisor, Firecracker) that interposes system calls at the kernel level. This limits what a compromised agent can do to the host.
- Hard resource limits on every agent run. Set explicit ceilings on CPU time, memory, API calls per invocation, and total token spend. The “agent bankrupted its operator” category of failure is almost always a missing resource limit story.
4. Circuit Breakers and Human Review Gates
No sandbox is airtight, and there is genuine debate about whether prompt injection is a patchable bug or a structural property of LLM architectures. Your secondary defense layer is stopping runaway or hijacked behavior before it completes its damage.
Implement these controls:
- Budget caps per run and per day. When an agent hits the spend limit, it stops and alerts. It does not continue until a human reviews the situation.
- Confirmation gates for irreversible actions. Any action that is difficult or impossible to undo — sending an email, deleting a record, submitting a form, making a payment — should require explicit confirmation before execution. A human in the loop at the right moment is a safety control, not a workflow bottleneck.
- Rate limits on outbound calls. If an agent is designed to make ten API calls per run, a circuit breaker at fifty is your safety net against runaway loops.
- Anomaly alerting. If an agent’s action log looks dramatically different from its baseline — more calls than expected, unfamiliar domains, unusual file paths — surface that to a human before execution continues.
The Ponytail project that surfaced on Hacker News this week captures the right mental model: make your agent think like the laziest senior engineer in the room. Do less. Ask first. The autonomy ceiling should be set explicitly, not left to whatever the model decides.
5. Audit Logs: Tamper-Evident and Stored Separately
When something goes wrong — and eventually something will — you need a complete, verifiable record of what the agent did, when, with what inputs, and with what results. This is a debugging requirement, a compliance requirement, and an increasingly standard ask in enterprise procurement conversations.
Your agent action logs should capture:
- Every tool call: name, inputs, output, timestamp
- Which user or system triggered the agent run
- Which model was used and which version
- The context passed to the model (or a content hash for privacy-sensitive workloads)
- Cost per run: token count, model charges, API fees
The log must be append-only and stored outside the agent’s operating environment. An agent with write access to its own logs can modify them. For regulated deployments in healthcare, finance, or legal, the logs should be exportable in a format your compliance team can actually work with.
The Pre-Deployment Checklist
Before your next agent goes live, answer these eight questions:
- Can you list every tool the agent has access to? Are all of them necessary for this specific task?
- What credentials does the agent use? Are they scoped to minimum permissions at the specific resource level?
- Does the agent make outbound network requests? Is there an in-path allowlist enforcing which domains it can reach?
- Does the agent execute code or write files? Is it running in an isolated environment?
- What is the maximum API spend this agent can incur per run? Per day?
- Which agent actions are irreversible? Does each one require a human confirmation step?
- Where are the agent’s action logs stored? Can the agent modify them?
- Who gets alerted when the agent errors, hits a resource limit, or produces anomalous output?
If you cannot answer all eight, the agent is not ready for production.
Why Infrastructure Ownership Matters Here
These controls are significantly easier to implement — and to audit — when you control the infrastructure your agents run on.
When your agent platform lives in someone else’s cloud, you are trusting their sandbox, their network policies, their logging, and their access controls. You cannot inspect them directly. You cannot extend them to fit your specific threat model. You cannot export them for a compliance audit without going through the vendor’s process on the vendor’s timeline.
That is the practical argument for self-hosted agent infrastructure: not that the cloud is inherently insecure, but that you cannot build a serious security posture on top of something you cannot fully inspect or control. The checklist above assumes you have access to network egress configuration, container runtimes, credential management, and log storage. You have that access on infrastructure you own.
Auxot is built for teams who need this level of control. You deploy it on your hardware or your own cloud account. The governance layer — model routing, action logging, access controls, cost limits — runs on your servers. When your auditor asks where the agent action logs are and who can access them, the answer is: on your infrastructure, under your control.
Get started at auxot.com/install. Deployment hardening guides are at auxot.com/tutorials.