How GPU Workers Work
A GPU worker connects to the router via a persistent WebSocket, registers itself, and waits for inference jobs. When a job arrives, the worker streams tokens back to the router, which relays them to the API caller.
The worker handles:
- Downloading the GGUF model from HuggingFace (cached to
~/.auxot/models/) - Downloading the correct llama.cpp server binary from GitHub Releases (cached to
~/.auxot/llama-server/) - Auto-detecting GPU hardware (Apple Metal, NVIDIA CUDA, AMD Vulkan, CPU fallback)
- Spawning and managing the llama.cpp subprocess
- Heartbeating to the router; auto-restarting on crash
Installation
Download the auxot-worker binary from GitHub Releases:
curl -Lo auxot-worker https://github.com/auxothq/auxot/releases/latest/download/auxot-worker-$(uname -s)-$(uname -m)
chmod +x auxot-worker
Running a Worker
AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker
On first run, the worker downloads the default model and llama.cpp. Subsequent starts use the local cache.
Model Selection
The router’s AUXOT_MODEL environment variable sets the default model for connected workers. You can also override it per-worker:
AUXOT_MODEL=Llama-3.3-70B-Instruct AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker
List available models from the embedded registry:
./auxot-worker models list
There are 700+ models available. Quantization is auto-selected based on detected VRAM if AUXOT_QUANTIZATION is not set.
Quantization Guide
| Quantization | VRAM (30B model) | Quality |
|---|---|---|
Q4_K_S | ~18 GB | Good — recommended for most use cases |
Q5_K_M | ~22 GB | Better quality, moderate size |
Q8_0 | ~32 GB | Near-lossless |
F16 | ~60 GB | Full precision |
Multi-GPU Tensor Parallelism
AUXOT_GPU_DEVICES=0,1 AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker
The worker splits the model across the specified device indices using llama.cpp’s tensor parallelism.
Daemon Install
Install the worker as a system service (systemd on Linux, launchd on macOS):
./auxot-worker install --name qwen --gpu-key adm_xxx --router-url router:8080
Air-Gapped Deployment
Pre-stage the model and llama.cpp binary, then point the worker at them directly:
./auxot-worker \
--model-path /data/models/Qwen3.5-35B-A3B-Q4_K_S.gguf \
--llama-server-path /data/bin/llama-server \
--gpu-key adm_xxx \
--router-url router:8080
No HuggingFace or GitHub access is required at runtime.