Connect a GPU worker

Run AI on your own hardware — connect a local GPU to Auxot so your agents think on a machine you control, with no data leaving your building.

Plus: three pasted asks for Admin Agent chat — sane model sizing for your hardware, dissecting stubborn gray dots from worker logs, and cloud fallback wording when somebody powers the box off nightly.

Audience	Admins · Developers
Time	~15 min (most of it waiting on a download)
Prerequisites	An Auxot account where you're an org admin. A computer with a GPU and Node.js 20+ installed. Network access between that computer and your Auxot server. (Free tier is fine — GPU workers aren't tier-gated.)
You'll end up with	One GPU worker connected to your Auxot account, downloading and serving an AI model on hardware you own — pick it in Chat and prompts you send route there.

When a tutorial shows italic text in quotation marks, it usually mirrors a label or helper string inside Auxot. Product copy changes between releases — if something reads differently in your workspace, trust what you see on screen.

Callouts with a Worth knowing gold accent are meant as must-read context before you move on. Blockquotes that open with Tip are lighter, optional depth.

Why this matters

Cloud AI providers are convenient, but every prompt your agents send leaves your building. For some businesses that’s no problem. For others (anyone handling client records, legal documents, financials, health information, or anything else with compliance weight), it’s a non-starter.

A GPU worker is the alternative. You run a small piece of software on a computer with a GPU you own. When you (or automation you wired) send work that Auxot assigns to your GPU-backed model, the tokens run on that machine and stream back: no trip through OpenAI or Anthropic servers for that slice of traffic.

Auxot makes this simpler than it sounds. You don’t write any code. You don’t compile anything. You don’t pick a quantization scheme by hand. You install Node.js (one command), run one npx command with a key Auxot gives you, and the worker handles everything else: detecting your GPU, downloading the model, and starting up.

The trade-off: you handle the hardware. Auxot doesn’t manage your driver versions, your power costs, or your fan speed. But for the right use case (sensitive data, predictable cost, and full control), running locally is exactly the point.

Today, connect one GPU worker. The model picker quietly lists hardware you maintain: workloads still wait for your message or for a cron job or webhook you wired, unless you scripted otherwise.

You generate a GPU key, run npx once on the machine you own, the connection completes, and the model downloads to disk. After that, something (usually you in Chat flipping the model picker) chooses that route consciously.

Quick start

Sign in to Auxot as an org admin.
Open Providers: Settings → Providers.
Add a GPU provider: click + Add Provider, pick GPU, name it (something like “Office GPU”), and create. Auxot generates a one-time key. Copy it. You’ll only see it once.
Pick a model: on the new provider’s detail page, choose a model family, model, and quantization the worker should serve. (Recommendations and VRAM estimates are shown right there.)
Run the worker on your GPU computer: open a terminal on the machine with your GPU and run:
```
npx --yes @auxot/worker-cli --gpu-key gpu.YOUR_KEY_HERE --router-url wss://your-auxot-host/ws
```
Replace gpu.YOUR_KEY_HERE with the key you copied in Step 3, and your-auxot-host with your Auxot server’s address.

Done? Within a minute or two, the worker finishes downloading the model and shows green in Auxot, meaning routing is allowed, not that anything is running on its own while you sleep. Prove it: Chat → pick your GPU-hosted model → you ping hello.

The agent can do that?

Talking through VRAM math in plain English beats guessing in a spreadsheet. Paste these into Admin Agent chat between fiddling provider toggles: he suggests, you still click Commit.

1. Narrow model and quantization choices to your hardware

I just connected a GPU worker. The hardware is [describe — e.g., "RTX 4090 with 24GB VRAM," "Apple Silicon M2 Max with 64GB unified memory," "two A100s, 80GB each"]. Looking at the work my agents do — [describe one or two agents' jobs] — what model and quantization should I run? Recommend two options: a starting choice that's safe and a more ambitious choice once I'm comfortable.

Why it’s non-obvious: Pick the wrong quantization and your evening is gone: either the model fails to load with out-of-memory errors, or it loads but the answers are noticeably worse. Paste hardware and two agent jobs honestly; the reply pairs a safe pick with a more ambitious one based on wording you volunteered. Installing weights still happens in your shell session.

2. Diagnose a worker that won’t connect

When the worker terminal output looks like it’s running but the green dot in Auxot stays gray:

My GPU worker is running on [machine description] but the Worker Connection dot in Auxot stays gray. The worker terminal shows [paste the last 20 lines of output]. What's the most likely cause, and what should I check next?

Why it’s non-obvious: Connection failures come from a small set of causes: a missing /ws on the router URL, a stale or wrong key, a firewall blocking the WebSocket, or a worker process that still appears running but stopped responding. Paste the log tail after reproducing: Admin Agent ranks those causes by likelihood instead of you working through all four manually at 11pm.

3. Draft cloud fallback for nights the GPU sleeps

Third paste, same Chat thread:

I want my agents to keep working even when my GPU is offline (powered down at night, model being switched, or machine restarting). Recommend a fallback configuration — which cloud provider should the GPU fall back to, under what conditions, and walk me through setting it up.

Why it’s non-obvious: A queue without a fallback means stalled answers when the machine sleeps. Fallback language routes work to spare cloud capacity until the GPU is back online: paste this prompt, then adjust Connect a cloud AI model settings yourself; Auxot won’t silently rent cloud capacity you never wired.

Go deeper

What the worker actually does, in plain English

When you run npx --yes @auxot/worker-cli ..., here’s what happens:

Downloads the worker binary for your operating system and CPU architecture (Linux x86, macOS ARM, Windows x86, etc.). Cached after first run.
Downloads llama.cpp, the AI inference engine. Cached after first run.
Connects to your Auxot server over a WebSocket using the key you provided. The server validates the key and tells the worker which model to serve.
Downloads that model from Hugging Face (a public model repository). Cached in ~/.auxot/models/. Re-uses if you already have it.
Starts llama.cpp in server mode, loaded with the model on your GPU.
Reports for duty to your Auxot server. From this point on, when traffic you routed chooses this provider and model pair, Auxot pipes that job through WebSocket ↔ llama.cpp and streams tokens home.

The worker stays connected continuously. It sends a heartbeat every few seconds; if your Auxot server doesn’t see one for 60 seconds, it marks the worker offline. Reconnect and the worker is online again.

No part of this involves Auxot or its developers seeing your data. The traffic is between your Auxot server (which you run) and your GPU machine (which you run). The model itself is downloaded from Hugging Face, but only the model file: not your prompts, not your responses, not your context files.

Hardware sizing: what model fits on what

Rough VRAM expectations for common configurations (using Q4_K_M quantization, a good balance of quality and size):

8 GB GPU (consumer card, older Apple Silicon): small models only: 7B class. Good for testing and lightweight tasks.
12–16 GB GPU (RTX 4060/4070, M1 Pro/Max): comfortable with 7B models, can run 13B with smaller context windows.
24 GB GPU (RTX 4090, M2 Max 32GB): comfortable with 30B models, the sweet spot for most use cases.
48–64 GB GPU/system (M3 Max 64GB, A6000): comfortable with 70B models. This is where the model genuinely starts to feel like the major cloud providers.
80+ GB GPU (A100 80GB, multiple A100s): 70B at higher quality settings, or experiment with frontier-class models.

The detail page in Auxot shows a VRAM estimate next to each model and quantization combination. Match the estimate to your hardware. A bit of headroom (10–20%) is good: running right at the limit can cause crashes when context grows during a long conversation.

Troubleshooting

The worker says it’s connected but the green dot in Auxot stays gray. The worker is running but the Auxot server isn’t receiving heartbeats. Most common cause: the --router-url is missing the /ws suffix. The full URL must end in /ws (e.g., wss://your-host/ws, not just wss://your-host). Stop the worker (Ctrl-C), fix the URL, re-run.
“Authentication failed” or “key rejected” in the worker output. The key you used isn’t valid. Either you copied it wrong, or someone rotated the key after you copied it. Open the provider’s detail page, click Rotate Key, copy the new key, re-run with that.
The model download takes forever or fails partway through. Hugging Face occasionally rate-limits, especially for very large model files. The download is resumable: if it fails, just re-run the same npx command and it’ll pick up where it left off. If your disk is filling, check ~/.auxot/models/ and clear out old models you’re not using.
“Out of memory” errors when the model loads. The model is too big for your GPU. Pick a smaller quantization (Q4_K_S instead of Q4_K_M, or Q3_K_M) on the provider detail page, or pick a smaller model. The worker will reload with the new choice on its next reconnect.
The worker disconnects and reconnects in a loop. Usually a network problem: packet loss between the GPU machine and your Auxot server, or aggressive firewall rules dropping idle WebSocket connections. Check with your network admin.
The GPU isn’t being used (CPU is hot, GPU is idle). Drivers aren’t installed properly. The worker fell back to CPU mode. Stop the worker, install your GPU’s drivers, re-run. The worker auto-detects the GPU at startup.
Everything looks right but agents aren’t using the GPU. Check Settings → Providers → Auto routing. The default fallback chain prefers GPU first, but if the org default is set to a specific cloud provider, the GPU only gets used when explicitly picked. Set Auto routing to your GPU (or leave it blank to use the default chain).

Variations & edge cases

Air-gapped deployments. If the GPU machine has no internet (compliance environments), provide the model file and llama.cpp binary yourself with --model-path and --llama-server-path flags. Skips the Hugging Face download entirely.
Multiple GPU workers. You can run more than one worker: same model on different machines (load distribution), or different models on different machines (specialized routing). Each is its own provider in Auxot, each with its own key.
Switching models. Stop the worker (Ctrl-C), change the model on the provider detail page, re-run the same npx command. The worker downloads the new model and serves it.
Auto-start at boot. Set the worker up as a system service so it survives reboots. Different per OS: search “run npx command as system service [your OS]” for current instructions.
CPU-only mode. If no GPU is detected, the worker runs on CPU. Slow but functional for testing or light use. Stick to 7B models in CPU mode.
Free tier: GPU workers aren’t tier-gated. Free tier accounts can connect a GPU worker the same way Business and Enterprise can.
Multi-team scoping (Business and Enterprise): Assign the GPU provider to specific teams on its detail page if you want only certain teams to be able to route work to it.

Walkthrough

Step 1: Confirm your GPU computer is ready

Before you start in Auxot, confirm three things on the machine that has the GPU:

a. Node.js 20 or newer is installed

Open a terminal and run:

node --version

You should see something like v20.10.0 or higher. If not, install Node.js from nodejs.org (download the “LTS” (Long-Term Support) version and run the installer). Re-run node --version after install to confirm.

b. The computer can reach your Auxot server

If your Auxot server is at https://auxot.yourcompany.com, open a browser on the GPU machine and visit it. You should see the Auxot login page. If you don’t, the GPU machine doesn’t have a network path to your server: fix that before proceeding (a firewall rule, a VPN, a DNS issue: your network admin will know).

c. Your GPU is one Auxot supports

Auxot’s worker auto-detects three kinds of GPU acceleration:

NVIDIA on Linux or Windows: needs CUDA 12.4+ drivers installed. Check with nvidia-smi in your terminal.
AMD on Linux: uses Vulkan. Most recent AMD GPUs (RX 6000-series and newer) work.
Apple Silicon Mac (M1, M2, M3, M4): uses Metal. No driver setup needed.

If none of the above, the worker will still run in CPU-only mode, but it’ll be slow and you’ll be limited to small models. CPU mode is fine for testing, not for production.

Tip: Not sure which GPU you have? On Linux/Windows: nvidia-smi (NVIDIA) or lspci | grep VGA (any GPU). On Mac: Apple menu → About This Mac → Graphics. If you see something with at least 8 GB of VRAM, you can run small-to-medium models comfortably. 24 GB and up gives you serious flexibility.

Open Auxot in your browser and sign in. You need to be an org admin for the next steps; non-admin users can see providers but can’t add or configure them.

Step 3: Open Providers and add a GPU provider

Click Settings in the left menu, then Providers. Click + Add Provider.

In the modal:

Provider type: pick GPU.
Name: what to call this worker. Something descriptive: “Office GPU,” “Server room A100,” or “Mac Studio downstairs.” Names matter when you have several connected.

The modal shows this hint: “A GPU key will be generated — copy it on the next screen to configure your worker.”

Click Create. Auxot drops you on the provider’s detail page.

Step 4: Copy the one-time GPU key

At the top of the detail page, a yellow banner appears with a long key value (it’ll start with gpu. followed by two long base64 strings). The banner says: “Copy your worker key — it won’t be shown again.”

That’s literal. Copy the whole key now. Click the inline copy button next to it. Paste it somewhere safe (a password manager, an encrypted note, a sticky in the GPU machine’s local terminal: whatever you can retrieve from in the next two minutes).

If you lose the key before connecting the worker, no big deal: click Rotate Key on this same page to generate a fresh one. The previous key stops working immediately.

Step 5: Pick a model

Below the worker connection banner, you’ll see the Configuration section. This is where you tell the worker which AI model to serve.

Three cascading fields:

Model family: the architecture (e.g., Qwen, Llama, Mistral). Each family has different strengths.
Model: the specific model within that family, often by parameter size (7B, 30B, 70B, etc.). Larger = smarter but heavier.
Quantization: the compression level. Q4_K_M is a good default: keeps quality high while shrinking the file. The page shows estimated VRAM usage per quantization.

For your first GPU worker, a safe starting choice is something like:

Family: Qwen 2.5
Model: 7B Instruct (small, fast, low VRAM)
Quantization: Q4_K_M

Once you’ve picked, scroll up to copy the worker connection command (Step 6). The worker reads your model choice from Auxot when it starts.

Tip: The detail page shows estimated VRAM next to each quantization. Match it to your GPU’s VRAM: if you have 24 GB and the model needs 18 GB, you’re fine. If it needs 28 GB, you’ll either crash on startup or the model won’t load. Pick smaller until you’re comfortable, then scale up.

Step 6: Run the worker on your GPU computer

Back on the provider’s detail page, find the Worker Connection section. Its instruction reads: ”Run this on the computer with your GPU (Node.js required):” followed by a command with $AUXOT_GPU_KEY as a placeholder.

Open a terminal on your GPU computer (not on your laptop, not in Auxot: on the actual machine with the GPU). Paste this command, with the real key substituted:

npx --yes @auxot/worker-cli --gpu-key gpu.YOUR_KEY_HERE --router-url wss://your-auxot-host/ws

Two substitutions to make:

gpu.YOUR_KEY_HERE: replace with the full key you copied in Step 4 (starts with gpu., has two parts separated by dots).
wss://your-auxot-host/ws: replace your-auxot-host with your Auxot server’s address. Keep the /ws at the end: that’s the WebSocket path; without it, the worker can’t connect.

If your Auxot is local (no HTTPS yet), use ws:// instead of wss://. Production deployments should always be wss://.

Press Enter. The worker starts:

First, it downloads the worker binary and llama.cpp (the underlying inference engine). Takes about a minute on a typical connection.
Then it downloads the model you picked in Step 5. Models range from 4 GB (small) to 50+ GB (large). Watch the progress bar: go grab a coffee for the bigger ones.
Then it loads the model into your GPU’s memory and starts serving inference.

You’ll see log output the whole time. Once you see something like “connected to router” and “server listening,” you’re online.

Step 7: Verify the worker is online in Auxot

Switch back to your browser. On the provider’s detail page, the Worker Connection section now shows a green dot with a status of “Online.” If it’s been ten seconds and the dot is still gray, scroll down to Troubleshooting.

Open Chat with any agent whose model picker includes this route: flip the selector, you send a prompt, and you’ll hear the GPU fans spin up as it serves the reply.

Tip: Want the worker to start automatically when the GPU machine reboots? Set it up as a system service (systemd on Linux, launchd on macOS, a Windows service). The exact setup is OS-specific: search “run nodejs as system service [your OS]” for current instructions. Until you do this, the worker only runs while the terminal it started in stays open.

What’s next

→ Run the open-source inference router. Apache 2 auxot-router and auxot-worker when you need inference routing without Auxot Server’s agent stack (different binaries than this tutorial’s npx @auxot/worker-cli).
→ Connect a cloud AI model. Set up cloud as a fallback for when the GPU is offline. Many teams run both.
→ Take Auxot’s pulse in 10 seconds. System Health is where you’ll watch the GPU worker’s load and uptime once it’s running.
→ Create an agent from scratch. Pin specific agents to your GPU model on their detail page so the work you most want kept private always routes locally.

Reference

Pages in Auxot: Settings → Providers (GPU type), each provider’s detail page
Worker command: npx --yes @auxot/worker-cli --gpu-key <key> --router-url <url>/ws
Required: Node.js 20+ on the GPU machine, network path to your Auxot server, GPU key (one-time, from Auxot)
Auto-detected: GPU type (CUDA / Vulkan / Metal / CPU fallback)
Auto-downloaded: Worker binary, llama.cpp, model file (cached in ~/.auxot/)
Health check: WebSocket heartbeat; online if last_seen < 60 seconds
Permissions: Org admins create/configure providers; all roles can use them
See also: Finish first-run onboarding (wizard demo key vs production worker), Run the open-source inference router, Connect a cloud AI model, Take Auxot’s pulse in 10 seconds