GPU Workers – Auxot Docs

How GPU Workers Work

A GPU worker connects to the router via a persistent WebSocket, registers itself, and waits for inference jobs. When a job arrives, the worker streams tokens back to the router, which relays them to the API caller.

The worker handles:

Downloading the GGUF model from HuggingFace (cached to ~/.auxot/models/)
Downloading the correct llama.cpp server binary from GitHub Releases (cached to ~/.auxot/llama-server/)
Auto-detecting GPU hardware (Apple Metal, NVIDIA CUDA, AMD Vulkan, CPU fallback)
Spawning and managing the llama.cpp subprocess
Heartbeating to the router; auto-restarting on crash

Installation

Download the auxot-worker binary from GitHub Releases:

curl -Lo auxot-worker https://github.com/auxothq/auxot/releases/latest/download/auxot-worker-$(uname -s)-$(uname -m)
chmod +x auxot-worker

Running a Worker

AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker

On first run, the worker downloads the default model and llama.cpp. Subsequent starts use the local cache.

Model Selection

The router’s AUXOT_MODEL environment variable sets the default model for connected workers. You can also override it per-worker:

AUXOT_MODEL=Llama-3.3-70B-Instruct AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker

List available models from the embedded registry:

./auxot-worker models list

There are 700+ models available. Quantization is auto-selected based on detected VRAM if AUXOT_QUANTIZATION is not set.

Quantization Guide

Quantization	VRAM (30B model)	Quality
`Q4_K_S`	~18 GB	Good — recommended for most use cases
`Q5_K_M`	~22 GB	Better quality, moderate size
`Q8_0`	~32 GB	Near-lossless
`F16`	~60 GB	Full precision

Multi-GPU Tensor Parallelism

AUXOT_GPU_DEVICES=0,1 AUXOT_GPU_KEY=adm_xxx AUXOT_ROUTER_URL=router:8080 ./auxot-worker

The worker splits the model across the specified device indices using llama.cpp’s tensor parallelism.

Daemon Install

Install the worker as a system service (systemd on Linux, launchd on macOS):

./auxot-worker install --name qwen --gpu-key adm_xxx --router-url router:8080

Air-Gapped Deployment

Pre-stage the model and llama.cpp binary, then point the worker at them directly:

./auxot-worker \
  --model-path /data/models/Qwen3.5-35B-A3B-Q4_K_S.gguf \
  --llama-server-path /data/bin/llama-server \
  --gpu-key adm_xxx \
  --router-url router:8080

No HuggingFace or GitHub access is required at runtime.