Tutorial 10

Bootstrap a GPU worker

Run AI on your own hardware — connect a local GPU to Auxot so your agents think on a machine you control, with no data leaving your building.

Plus: three prompts that turn the Admin Agent into your GPU-worker setup partner — picking the right model for your hardware, diagnosing a stuck connection, and recommending a fallback when the GPU is offline.

Audience Admins · Developers
Time ~15 min (most of it waiting on a download)
Prerequisites An Auxot account where you're an org admin. A computer with a GPU and Node.js 20+ installed. Network access between that computer and your Auxot server. (Free tier is fine — GPU workers aren't tier-gated.)
You'll end up with One GPU worker connected to your Auxot account, downloading and serving an AI model on hardware you own — your agents now think locally.

Why this matters

Cloud AI providers are convenient, but every prompt your agents send leaves your building. For some businesses that’s no problem. For others — anyone handling client records, legal documents, financials, health information, or anything else with compliance weight — it’s a non-starter.

A GPU worker is the alternative. You run a small piece of software on a computer with a GPU you own. Auxot routes your agents’ work to that machine. The AI model thinks there. The answer comes back. No data ever touches a cloud provider.

Auxot makes this simpler than it sounds. You don’t write any code. You don’t compile anything. You don’t pick a quantization scheme by hand. You install Node.js (one command), run one npx command with a key Auxot gives you, and the worker handles everything else — detecting your GPU, downloading the model, starting up.

The trade-off: you handle the hardware. Auxot doesn’t manage your driver versions, your power costs, or your fan speed. But for the right use case — sensitive data, predictable cost, full control — running locally is exactly the point.

Today, connect one GPU worker. Tomorrow, your agents are thinking on hardware you own, with no data leaving your network.

You generate a one-time GPU key in Auxot. You run a single command on your GPU computer with that key. The worker connects to your Auxot server, downloads the model you picked, and starts serving inference. Your agents start using it within a minute or two.


Quick start

  1. Sign in to Auxot as an org admin.
  2. Open ProvidersSettings → Providers.
  3. Add a GPU provider — click + Add Provider, pick GPU, name it (something like “Office GPU”), and create. Auxot generates a one-time key. Copy it — you’ll only see it once.
  4. Pick a model — on the new provider’s detail page, choose a model family, model, and quantization the worker should serve. (Recommendations and VRAM estimates are shown right there.)
  5. Run the worker on your GPU computer — open a terminal on the machine with your GPU and run:
    npx --yes @auxot/worker-cli --gpu-key gpu.YOUR_KEY_HERE --router-url wss://your-auxot-host/ws
    
    Replace gpu.YOUR_KEY_HERE with the key you copied in Step 3, and your-auxot-host with your Auxot server’s address.

Done? Within a minute or two, the worker downloads the model and connects. The provider’s status dot turns green. Your agents can now route work to your GPU.


The agent can do that?

You don’t have to make all the GPU-worker decisions yourself. The Admin Agent can recommend models based on your hardware, diagnose stuck connections, and configure fallback for when the GPU is unavailable. Three prompts.

1. Have the Admin Agent recommend the right model for your hardware

Open chat with the Admin Agent and ask:

I just connected a GPU worker. The hardware is [describe — e.g., "RTX 4090 with 24GB VRAM," "Apple Silicon M2 Max with 64GB unified memory," "two A100s, 80GB each"]. Looking at the work my agents do — [describe one or two agents' jobs] — what model and quantization should I run? Recommend two options: a starting choice that's safe and a more ambitious choice once I'm comfortable.

Why it’s non-obvious: Picking a model is the part of GPU setup most people get wrong by guessing. Pick too big, you’ll get out-of-memory errors. Pick too small, the model can’t handle the work and you’ll be frustrated. The Admin Agent can match your specific hardware to the actual jobs your agents do — small VRAM with a long-context job means a different model than big VRAM with a creative-writing job. Three minutes of conversation, weeks of frustration avoided.

2. Diagnose a worker that won’t connect

When the worker terminal output looks like it’s running but the green dot in Auxot stays gray:

My GPU worker is running on [machine description] but the Worker Connection dot in Auxot stays gray. The worker terminal shows [paste the last 20 lines of output]. What's the most likely cause, and what should I check next?

Why it’s non-obvious: Connection failures usually fall into one of four buckets — wrong URL (the /ws suffix is missing or the host is unreachable), wrong key (rotated, copied wrong, or expired), firewall blocking WebSocket traffic, or the worker process actually crashed but the terminal hasn’t shown the error yet. The Admin Agent can read your output and tell you which bucket it is, instead of you cycling through all four.

3. Set up a cloud fallback so agents keep working when the GPU is offline

Same chat:

I want my agents to keep working even when my GPU is offline (powered down at night, model being switched, machine restarting). Recommend a fallback configuration — which cloud provider should the GPU fall back to, under what conditions, and walk me through setting it up.

Why it’s non-obvious: Auxot’s queue-don’t-fail design means agents wait gracefully when the GPU goes away. But waiting isn’t the same as working. Adding a cloud fallback (covered in Tutorial 09) turns “wait until the GPU comes back” into “use the cloud invisibly until the GPU is back, then route there again.” The Admin Agent can recommend the right rules and walk you through the configuration in one conversation.


Go deeper

What the worker actually does, in plain English

When you run npx --yes @auxot/worker-cli ..., here’s what happens:

  1. Downloads the worker binary for your operating system and CPU architecture (Linux x86, macOS ARM, Windows x86, etc.). Cached after first run.
  2. Downloads llama.cpp, the AI inference engine. Cached after first run.
  3. Connects to your Auxot server over a WebSocket using the key you provided. The server validates the key and tells the worker which model to serve.
  4. Downloads that model from Hugging Face (a public model repository). Cached in ~/.auxot/models/. Re-uses if you already have it.
  5. Starts llama.cpp in server mode, loaded with the model on your GPU.
  6. Reports for duty to your Auxot server. From this point on, when an agent’s work needs inference, your Auxot server sends the request through the WebSocket, llama.cpp processes it, and the answer streams back the same way.

The worker stays connected continuously. It sends a heartbeat every few seconds; if your Auxot server doesn’t see one for 60 seconds, it marks the worker offline. Reconnect and the worker is online again.

No part of this involves Auxot or its developers seeing your data. The traffic is between your Auxot server (which you run) and your GPU machine (which you run). The model itself is downloaded from Hugging Face, but only the model file — not your prompts, not your responses, not your context files.

Hardware sizing — what model fits on what

Rough VRAM expectations for common configurations (using Q4_K_M quantization, a good balance of quality and size):

  • 8 GB GPU (consumer card, older Apple Silicon) — small models only: 7B class. Good for testing and lightweight tasks.
  • 12–16 GB GPU (RTX 4060/4070, M1 Pro/Max) — comfortable with 7B models, can run 13B with smaller context windows.
  • 24 GB GPU (RTX 4090, M2 Max 32GB) — comfortable with 30B models, the sweet spot for most use cases.
  • 48–64 GB GPU/system (M3 Max 64GB, A6000) — comfortable with 70B models. This is where the model genuinely starts to feel like the major cloud providers.
  • 80+ GB GPU (A100 80GB, multiple A100s) — 70B at higher quality settings, or experiment with frontier-class models.

The detail page in Auxot shows a VRAM estimate next to each model and quantization combination. Match the estimate to your hardware. A bit of headroom (10–20%) is good — running right at the limit can cause crashes when context grows during a long conversation.

Troubleshooting
  • The worker says it’s connected but the green dot in Auxot stays gray. The worker is running but the Auxot server isn’t receiving heartbeats. Most common cause: the --router-url is missing the /ws suffix. The full URL must end in /ws (e.g., wss://your-host/ws, not just wss://your-host). Stop the worker (Ctrl-C), fix the URL, re-run.
  • “Authentication failed” or “key rejected” in the worker output. The key you used isn’t valid. Either you copied it wrong, or someone rotated the key after you copied it. Open the provider’s detail page, click Rotate Key, copy the new key, re-run with that.
  • The model download takes forever or fails partway through. Hugging Face occasionally rate-limits, especially for very large model files. The download is resumable — if it fails, just re-run the same npx command and it’ll pick up where it left off. If your disk is filling, check ~/.auxot/models/ and clear out old models you’re not using.
  • “Out of memory” errors when the model loads. The model is too big for your GPU. Pick a smaller quantization (Q4_K_S instead of Q4_K_M, or Q3_K_M) on the provider detail page, or pick a smaller model. The worker will reload with the new choice on its next reconnect.
  • The worker disconnects and reconnects in a loop. Usually a network problem — packet loss between the GPU machine and your Auxot server, or aggressive firewall rules dropping idle WebSocket connections. Check with your network admin.
  • The GPU isn’t being used (CPU is hot, GPU is idle). Drivers aren’t installed properly. The worker fell back to CPU mode. Stop the worker, install your GPU’s drivers, re-run. The worker auto-detects the GPU at startup.
  • Everything looks right but agents aren’t using the GPU. Check Settings → Providers → Auto routing. The default fallback chain prefers GPU first, but if the org default is set to a specific cloud provider, the GPU only gets used when explicitly picked. Set Auto routing to your GPU (or leave it blank to use the default chain).
Variations & edge cases
  • Air-gapped deployments. If the GPU machine has no internet (compliance environments), provide the model file and llama.cpp binary yourself with --model-path and --llama-server-path flags. Skips the Hugging Face download entirely.
  • Multiple GPU workers. You can run more than one worker — same model on different machines (load distribution), or different models on different machines (specialized routing). Each is its own provider in Auxot, each with its own key.
  • Switching models. Stop the worker (Ctrl-C), change the model on the provider detail page, re-run the same npx command. The worker downloads the new model and serves it.
  • Auto-start at boot. Set the worker up as a system service so it survives reboots. Different per OS — search “run npx command as system service [your OS]” for current instructions.
  • CPU-only mode. If no GPU is detected, the worker runs on CPU. Slow but functional for testing or light use. Stick to 7B models in CPU mode.
  • Free tier: GPU workers aren’t tier-gated. Free tier accounts can connect a GPU worker the same way Business and Enterprise can.
  • Multi-team scoping (Business and Enterprise): Assign the GPU provider to specific teams on its detail page if you want only certain teams to be able to route work to it.

Walkthrough

Step 1: Confirm your GPU computer is ready

Before you start in Auxot, confirm three things on the machine that has the GPU:

a. Node.js 20 or newer is installed

Open a terminal and run:

node --version

You should see something like v20.10.0 or higher. If not, install Node.js from nodejs.org (download the “LTS” version — Long-Term Support — and run the installer). Re-run node --version after install to confirm.

b. The computer can reach your Auxot server

If your Auxot server is at https://auxot.yourcompany.com, open a browser on the GPU machine and visit it. You should see the Auxot login page. If you don’t, the GPU machine doesn’t have a network path to your server — fix that before proceeding (a firewall rule, a VPN, a DNS issue — your network admin will know).

c. Your GPU is one Auxot supports

Auxot’s worker auto-detects three kinds of GPU acceleration:

  • NVIDIA on Linux or Windows — needs CUDA 12.4+ drivers installed. Check with nvidia-smi in your terminal.
  • AMD on Linux — uses Vulkan. Most recent AMD GPUs (RX 6000-series and newer) work.
  • Apple Silicon Mac (M1, M2, M3, M4) — uses Metal. No driver setup needed.

If none of the above, the worker will still run in CPU-only mode — but it’ll be slow, and you’ll be limited to small models. CPU mode is fine for testing, not for production.

Tip: Not sure which GPU you have? On Linux/Windows: nvidia-smi (NVIDIA) or lspci | grep VGA (any GPU). On Mac: Apple menu → About This Mac → Graphics. If you see something with at least 8 GB of VRAM, you can run small-to-medium models comfortably. 24 GB and up gives you serious flexibility.

Step 2: Sign in to Auxot

Open Auxot in your browser and sign in. You need to be an org admin for the next steps; non-admin users can see providers but can’t add or configure them.

Step 3: Open Providers and add a GPU provider

Click Settings in the left menu, then Providers. Click + Add Provider.

In the modal:

  • Provider type — pick GPU.
  • Name — what to call this worker. Something descriptive: “Office GPU,” “Server room A100,” “Mac Studio downstairs.” Names matter when you have several connected.

The modal shows this hint: “A GPU key will be generated — copy it on the next screen to configure your worker.”

Click Create. Auxot drops you on the provider’s detail page.

Step 4: Copy the one-time GPU key

At the top of the detail page, a yellow banner appears with a long key value (it’ll start with gpu. followed by two long base64 strings). The banner says: “Copy your worker key — it won’t be shown again.”

That’s literal. Copy the whole key now. Click the inline copy button next to it. Paste it somewhere safe (a password manager, an encrypted note, a sticky in the GPU machine’s local terminal — whatever you can retrieve from in the next two minutes).

If you lose the key before connecting the worker, no big deal: click Rotate Key on this same page to generate a fresh one. The previous key stops working immediately.

Step 5: Pick a model

Below the worker connection banner, you’ll see the Configuration section. This is where you tell the worker which AI model to serve.

Three cascading fields:

  • Model family — the architecture (e.g., Qwen, Llama, Mistral). Each family has different strengths.
  • Model — the specific model within that family, often by parameter size (7B, 30B, 70B, etc.). Larger = smarter but heavier.
  • Quantization — the compression level. Q4_K_M is a good default — keeps quality high while shrinking the file. The page shows estimated VRAM usage per quantization.

For your first GPU worker, a safe starting choice is something like:

  • Family: Qwen 2.5
  • Model: 7B Instruct (small, fast, low VRAM)
  • Quantization: Q4_K_M

Once you’ve picked, scroll up to copy the worker connection command (Step 6). The worker reads your model choice from Auxot when it starts.

Tip: The detail page shows estimated VRAM next to each quantization. Match it to your GPU’s VRAM — if you have 24 GB and the model needs 18 GB, you’re fine. If it needs 28 GB, you’ll either crash on startup or the model won’t load. Pick smaller until you’re comfortable, then scale up.

Step 6: Run the worker on your GPU computer

Back on the provider’s detail page, find the Worker Connection section. Its instruction reads: “Run this on the computer with your GPU (Node.js required):” — followed by a command with $AUXOT_GPU_KEY as a placeholder.

Open a terminal on your GPU computer (not on your laptop, not in Auxot — on the actual machine with the GPU). Paste this command, with the real key substituted:

npx --yes @auxot/worker-cli --gpu-key gpu.YOUR_KEY_HERE --router-url wss://your-auxot-host/ws

Two substitutions to make:

  • gpu.YOUR_KEY_HERE — replace with the full key you copied in Step 4 (starts with gpu., has two parts separated by dots).
  • wss://your-auxot-host/ws — replace your-auxot-host with your Auxot server’s address. Keep the /ws at the end — that’s the WebSocket path; without it, the worker can’t connect.

If your Auxot is local (no HTTPS yet), use ws:// instead of wss://. Production deployments should always be wss://.

Press Enter. The worker starts:

  1. First, it downloads the worker binary and llama.cpp (the underlying inference engine). Takes about a minute on a typical connection.
  2. Then it downloads the model you picked in Step 5. Models range from 4 GB (small) to 50+ GB (large). Watch the progress bar — go grab a coffee for the bigger ones.
  3. Then it loads the model into your GPU’s memory and starts serving inference.

You’ll see log output the whole time. Once you see something like “connected to router” and “server listening,” you’re online.

Step 7: Verify the worker is online in Auxot

Switch back to your browser. On the provider’s detail page, the Worker Connection section now shows a green dot with a status of “Online.” If it’s been ten seconds and the dot is still gray, scroll down to Troubleshooting.

Open chat with any agent, click the model picker at the top, and you’ll see your new worker’s model in the list. Pick it, send a message, watch your local GPU answer.

Tip: Want the worker to start automatically when the GPU machine reboots? Set it up as a system service (systemd on Linux, launchd on macOS, a Windows service). The exact setup is OS-specific — search “run nodejs as system service [your OS]” for current instructions. Until you do this, the worker only runs while the terminal it started in stays open.


What’s next

Reference

  • Pages in Auxot: Settings → Providers (GPU type), each provider’s detail page
  • Worker command: npx --yes @auxot/worker-cli --gpu-key <key> --router-url <url>/ws
  • Required: Node.js 20+ on the GPU machine, network path to your Auxot server, GPU key (one-time, from Auxot)
  • Auto-detected: GPU type (CUDA / Vulkan / Metal / CPU fallback)
  • Auto-downloaded: Worker binary, llama.cpp, model file (cached in ~/.auxot/)
  • Health check: WebSocket heartbeat; online if last_seen < 60 seconds
  • Permissions: Org admins create/configure providers; all roles can use them
  • See also: Tutorial 09: Connect a cloud AI model, Tutorial 03: Take Auxot’s pulse