How I Sandbox AI Coding Agents

The only real way to run agentic workloads is to run them autonomously. That requires bypass permissions or yolo mode.

Which is dangerous.

The agent has the tools and means to take any action. An agent could be prompt injected and compromised, which of course is a threat surface, but also that the agent might just decide to run a dangerous command.

Subscribe now

So how do you run yolo mode and still keep safe?

I previously wrote a post about the agent security stack and I have been working to make this my day to day harness where I run claude code and pi.

In this post I will explain my choices and where they have landed. I am not sure if this is the best way, but this is what I have right now.

The main point of all this build is to design in tiers. Clean abstractions. Short lived task sandboxes with limited blast radius and no credentials available to the actual underlying agents at all.

Project based LimaVM for host isolation.

The way I work is that I have a main Chief of Staff orchestrator agent which controls per task worktrees. I run all of these inside a lima vm for each project. For me projects are per client. By running them within the lima vm I isolate the projects and my host from any rogue agents and keep client information locked within the vm with no host side mounts.

Inside the vm, I have openshell running individual sandbox per coding agent instance. So it looks something like this.

Host machine
└── Lima VM  (one per project, persistent)
    └── OpenShell sandbox  (one per active task, ephemeral)

The outer tier is a Lima VM, one per project. It runs the OpenShell gateway daemon, which is the piece responsible for network policy enforcement. Creating it takes ~15 seconds and happens once when you register a project. After that it stays running between tasks, so task launch pays none of that startup cost. It also holds no host filesystem access and no host SSH keys. A compromised VM can’t reach back to the machine running it.

The inner tier is an OpenShell sandbox, one per task. It’s backed by a Podman container using a custom image I build per project. It starts at task launch and gets deleted at task end. If a task goes wrong or an agent misbehaves, process goes haywire, the blast radius is one container, not the VM, not the host.

The sandbox image

I have to make a custom image that extends the OpenShell community base image with the tools a coding agent actually needs, which in my case is

Node.js 20 + Claude Code, Codex and Pi : Coding agent binaries
Python 3.12 + uv — for projects that use Python
psql — so agents can run database tests directly
Tirith — a fail-closed shell-command guard, installed at a root-owned path so the agent can’t overwrite it

Two things that aren’t in that list are worth naming. There’s no curl binary. It’s explicitly excluded because it’s a common way to exfiltrate data or pull arbitrary payloads, and nothing the agent legitimately does needs it (node and claude can reach the permitted endpoints). There’s also no host bind-mount, the agent only ever sees the project repo, cloned fresh into the sandbox at task start.

Network policy: no static allowlist

The Privacy Router is what the sandbox uses as its default gateway. Every outbound HTTPS connection goes through it. The router’s job is to decide: should this connection be allowed?

I don’t use a static allowlist. The Privacy Router runs AI inference to make allow/block decisions, using a policy.yaml file as input. That policy describes intent. “Claude Code needs to reach Anthropic’s API,”, “git needs to reach GitHub” and the router’s inference decides whether the actual outbound request matches that intent.

The policies I baked into the image:

| Policy | Binaries | Purpose |
|---|---|---|
| `claude_code` | `claude`, `node` | Anthropic API |
| `github_git_rw` | `git` | GitHub push + pull |
| `github_api` | `claude`, `gh`, `git` | GitHub REST API — PR creation |
| `pypi` | `uv`, `pip` | Package installs |
| `npm` | `npm`, `node` | npm registry |
| `codex` | `codex`, `node` | OpenAI API (reviewer agent) |
| `local_llm` | `pi`, `claude`, `node` | Local LLM at `llm.local` |

One gotcha I hit while setting this up: the AI inference inside the Privacy Router uses its training data to decide if a hostname is legitimate. Custom hostnames that aren’t in the training set, things like inference.local get blocked even if you write a policy entry for them. The workaround is to use OpenShell’s built-in inference routing (more on that below) rather than trying to reach custom hostnames directly through the policy.

Credentials: the placeholder model

This is the part that required the most design work.

The naive approach is to put credentials in the sandbox as environment variables. The agent picks them up, makes API calls, everything works. The problem is that credentials then exist in plaintext inside the container. Container logs, process listings, debug dumps, any of these could leak them.

This is something I have been especially concerned with as I have sent my agents onto the internet because rogue pages could prompt inject your agent to reveal its credentials. Even though it does not happen with the modern and more advanced models but the gap is not closed it could happen anytime. Which means that if it is compromised then the credentials could be lost.

The approach I use: Real credentials NEVER enter the agent’s sandbox.

A provisioner running on the host machine, i.e. my desktop computer, registers them as OpenShell Providers in the gateway before the sandbox is created, and the Privacy Router, which runs in the VM at the gateway tier substitutes them on outbound HTTPS connections. The credentials do live in the VM (the gateway must hold them to do the substitution); the trust boundary is the sandbox container, where the agent only ever sees an opaque placeholder.

Something like this for each credential:

ANTHROPIC_OAUTH_TOKEN=openshell:resolve:env:ANTHROPIC_OAUTH_TOKEN

That string is meaningless to anything except the Privacy Router. When claude makes a request to api.anthropic.com with Authorization: Bearer openshell:resolve:env:..., the router intercepts it, resolves the real token, and substitutes it before the packet leaves the VM. The agent process never sees the real value.

The credential declarations live in cred-providers.yaml, tracked in git with no secrets:

sandbox:
  providers:
    - type: claude_code
      token_env: ANTHROPIC_OAUTH_TOKEN

    - type: pi
      model_id: Qwen3.6-35B-A3B-UD-IQ4_XS.gguf

The real values live in a gitignored .env file. The provisioner reads both at task launch time and registers the providers. If a provider is declared but its *_env key is missing from .env, you get a hard error.

The credential injection gotcha

I use Claude Max plan which is Oauth token based. However, Claude Code doesn’t read credentials from environment variables at API-call time. It validates them locally first, before making any network request, and refuses to proceed if what it sees doesn’t look like a real token. So you can’t just set ANTHROPIC_OAUTH_TOKEN in the sandbox environment and call it done. Claude Code will reject the placeholder before it ever hits the Privacy Router.

The fix is to pass the placeholder via ANTHROPIC_AUTH_TOKEN in the exec environment when launching claude inside the sandbox, not as a static sandbox env var, but at exec time:

# 1. Read the placeholder from sandbox env
rc, placeholder, _ = provisioner.exec(sbx_id,
    ["bash", "-c", "printf '%s' \"$ANTHROPIC_OAUTH_TOKEN\""])

# 2. Pass it as ANTHROPIC_AUTH_TOKEN when running claude
exit_code, stdout, _ = provisioner.exec(
    sbx_id,
    ["claude", "-p", prompt, "--output-format", "text"],
    environment={"ANTHROPIC_AUTH_TOKEN": placeholder},
)

Claude Code skips its local validation for ANTHROPIC_AUTH_TOKEN (it’s designed to work with dynamically resolved tokens), emits Authorization: Bearer <placeholder>, and the Privacy Router substitutes the real OAuth token on the wire. eader.

Two egress layers, for different threat models

The Privacy Router handles HTTPS. But HTTPS isn’t the only way traffic can leave a container. An agent could try to reach an internal IP directly. The host machine, the Lima VM’s internal network, other containers.

The second layer is iptables on the Lima VM, applied to the Podman container network (10.89.0.0/24):

ALLOW  → 192.168.5.2:8443   (llama.cpp HTTPS on host — only raw TCP exception)
DROP   → 10.0.0.0/8
DROP   → 172.16.0.0/12
DROP   → 192.168.0.0/16

The reasoning is threat-model separation. The Privacy Router is very good at policing HTTPS destinations. It’s not involved at all if something tries a raw TCP connection to a private IP. iptables is very good at blocking those. Neither alone covers both attack surfaces.

Local LLM inference

I have recently started using local models heavily. I think they are getting close to performance, at least for sonnet like workloads.

My host machine runs llama.cpp with TLS on port 8443. The obvious thing to do would be to add a policy entry allowing the sandbox to reach 192.168.5.2:8443 directly. That doesn’t work.

The Privacy Router’s AI inference step blocks RFC 1918 destinations. That is private IPs, regardless of explicit policy entries. Even if you write a rule saying “allow connections to 192.168.5.2,” the inference step overrides it because that looks like internal network scanning.

The correct mechanism is OpenShell’s inference routing. You register a provider and tell the router to route inference.local to it:

provisioner.configure_local_llm(model_id="Qwen3.6-35B-A3B-UD-IQ4_XS.gguf")

This calls CreateProvider pointing at http://192.168.5.2:11434(the host IP) and SetClusterInference to route inference.local there.

Inside the sandbox, agents call

https://inference.local/v1/chat/completions.

The Privacy Router intercepts that, sees it’s a recognized inference routing destination, and forwards it to the provider’s backend without the private-IP block applying.

Still had to jump a small hurdle in the backend chain. The OpenShell gateway’s outbound HTTP client uses Rust’s rustls with bundled WebPKI CAs. It can’t verify self-signed TLS certs. llama.cpp runs with TLS and a self-signed cert. So the gateway can’t connect to llama.cpp directly.

The fix is a two-hop bridge. The host runs a socat relay (HTTP:11434 → HTTPS:8443, without verification) that the gateway can connect to over plain HTTP. Inside each Lima VM, a llm-http-bridge service does a similar relay (HTTP:8080 → llama.cpp on the host, this time with TLS verification against the CA cert).

The full path:

sandbox
  → https://inference.local
    → Privacy Router (inference routing)
      → http://192.168.5.2:11434  (host socat relay)
        → https://localhost:8443  (llama.cpp TLS)

This is more hops than I wanted, but each exists for a specific reason: the inference routing avoids the RFC 1918 block, the host socat relay works around rustls’s CA limitations, and the VM-side bridge adds TLS verification for the internal segment.

Tirith: guarding what the agent runs

Tirith inspects every shell command the agent runs and blocks the dangerous ones such as curl … | bash, homograph URLs, obfuscated payloads before they execute.

The role it fills is different from the Privacy Router’s. The Privacy Router sees network traffic and decides what hosts are reachable. It has no view into the shape of a command, a pipe-to-shell, a base64-decode-and-exec, a typosquatted domain. Tirith covers that layer. Both are needed; neither alone is enough.

Wiring it up correctly was challenging and taught me much about where command auditing actually has to live. Tirith’s bash hook is built for interactive terminals as it binds the Enter key or a preexec trap, neither of which fires when commands run as non-interactive bash -c.

So “add it to the shell profile” guards nothing.

For commands my runner dispatches into the sandbox, the mechanism that works is BASH_ENV: bash sources that file on every non-interactive bash -c and exposes the command in $BASH_EXECUTION_STRING, so I point it at a small guard script that runs tirith check and exits before the command runs if Tirith flags it.

That covers what the runner runs but not what the agent runs. Claude Code executes its own Bash tool in a persistent, sanitised shell that BASH_ENV never reaches, so the guard never sees those commands.

I only caught this because a test that ran a malicious command through the agent came back clean when it shouldn’t have!

The fix is Claude Code’s own PreToolUse hook: it fires before every Bash tool call. Even and especially in --dangerously-skip-permissions mode, and a hook that returns “deny” blocks the command. So the same tirith check runs there too. Critically, I register that hook in Claude Code’s managed settings (/etc/claude-code/managed-settings.json), the root-owned tier the agent can’t override — otherwise the agent could simply delete its own guard.

The policy itself is baked into the image and forced via TIRITH_POLICY_ROOT, which outranks any .tirith/policy.yaml the cloned repo might ship so that an untrusted repo can’t weaken its own sandbox. Tirith and both guards live at root-owned paths the non-root agent can’t overwrite None of this is a hard guarantee as a determined agent has other avenues but it closes the easy ones and makes the command layer enforceable, not just observable.

Where this lands

Stepping back, the whole thing is a set of nested boundaries, each one built on the assumption that the boundary inside it might already be compromised.

Two ideas run through all of it. The first is that no credential ever reaches the agent and the trust boundary is the sandbox container, and inside it there are only placeholders. The real tokens live one tier up, in the VM gateway, and get substituted on the wire.

The second is defense in depth that respects threat models: Each layer assumes the others can fail, and no single layer is asked to police an attack surface it isn’t suited for. The Privacy Router is excellent at HTTPS and useless against a raw socket while iptables is the reverse. Tirith sees command shape but not network destinations. Stacked, they cover each other’s blind spots.

None of this makes yolo mode safe in an absolute sense and I want to be honest about that. Writing security software is an exercise in eventual disappointment!

A sufficiently determined, deeply compromised agent still has avenues I haven’t very likely closed. What this setup does do is shrink the blast radius to a throwaway container, keep the keys out of the agent’s reach entirely, and turn “the agent can do anything” into “the agent can do anything inside a smaller box.”

That’s where I’ve landed on for running coding agents autonomously, day to day. I’m still not sure it’s the best way but it’s one I can leave running overnight without staring at the permission prompt.