How GPU Workers Work

A GPU worker connects to the router via a persistent WebSocket, registers itself, and waits for inference jobs. When a job arrives, the worker streams tokens back to the router, which relays them to the API caller.

The worker handles:

  • Downloading the GGUF model from HuggingFace (cached to ~/.auxot/models/)
  • Downloading the correct llama.cpp server binary from GitHub Releases (cached to ~/.auxot/llama-server/)
  • Auto-detecting GPU hardware (Apple Metal, NVIDIA CUDA, AMD Vulkan, CPU fallback)
  • Spawning and managing the llama.cpp subprocess
  • Heartbeating to the router; auto-restarting on crash

Installation

Download the auxot-worker binary from GitHub Releases:

curl -Lo auxot-worker https://github.com/auxothq/auxot/releases/latest/download/auxot-worker-$(uname -s)-$(uname -m)
chmod +x auxot-worker

Running a Worker

AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker

On first run, the worker downloads the default model and llama.cpp. Subsequent starts use the local cache.


Model Selection

The router’s AUXOT_MODEL environment variable sets the default model for connected workers. You can also override it per-worker:

AUXOT_MODEL=Llama-3.3-70B-Instruct AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker

List available models from the embedded registry:

./auxot-worker models list

There are 700+ models available. Quantization is auto-selected based on detected VRAM if AUXOT_QUANTIZATION is not set.


Quantization Guide

QuantizationVRAM (30B model)Quality
Q4_K_S~18 GBGood — recommended for most use cases
Q5_K_M~22 GBBetter quality, moderate size
Q8_0~32 GBNear-lossless
F16~60 GBFull precision

Multi-GPU Tensor Parallelism

AUXOT_GPU_DEVICES=0,1 AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker

The worker splits the model across the specified device indices using llama.cpp’s tensor parallelism.


Daemon Install

Install the worker as a system service (systemd on Linux, launchd on macOS):

./auxot-worker install --name qwen --gpu-key adm_xxx --router-url router:8080

Air-Gapped Deployment

Pre-stage the model and llama.cpp binary, then point the worker at them directly:

./auxot-worker \
  --model-path /data/models/Qwen3.5-35B-A3B-Q4_K_S.gguf \
  --llama-server-path /data/bin/llama-server \
  --gpu-key adm_xxx \
  --router-url router:8080

No HuggingFace or GitHub access is required at runtime.