GPU Workers – Auxot Docs

What GPU Workers Do

GPU workers run open-weight GGUF models using llama.cpp as the inference backend. They connect outbound to your Auxot router and serve completions at zero marginal cost — you only pay for hardware and electricity. GPU workers are the highest-priority provider type, so when one is online and supports the requested model, it handles the request before CLI or cloud providers are considered.

Setup

1. Create a Worker Key

Generate a worker key through Settings → Providers or ask the admin agent (Create a new GPU worker key).

2. Install and Run the Worker

Auxot Server uses the same open-source auxot-worker binary. See GPU Workers (OSS) → for installation, CLI flags, model selection, quantization, multi-GPU setup, and air-gapped deployment.

Point the worker at your server with the key from step 1:

export AUXOT_ROUTER_URL=https://ai.yourcompany.com
export AUXOT_GPU_KEY=<worker-key-from-admin>

auxot-worker --type gpu

On startup, the worker:

Connects to the router and registers itself.
Reports available VRAM, loaded models, and concurrency capacity.
Begins sending heartbeats every 30 seconds.

Auto-Loading Models

Configure models to load automatically on worker startup:

export AUXOT_GPU_AUTOLOAD_MODELS=llama-3.3-70b-q4_k_m,qwen-2.5-coder-32b-q6_k

Hardware Requirements

Minimum viable configurations:

GPU	VRAM	Recommended Models
NVIDIA RTX 4090	24 GB	7B–14B at q8, 32B at q4
NVIDIA A100 (40GB)	40 GB	70B at q4
NVIDIA A100 (80GB)	80 GB	70B at q8
NVIDIA H100	80 GB	70B at q8 with high throughput
2× RTX 4090	48 GB	70B at q4 (tensor parallel)

AMD ROCm GPUs are supported on Linux. Apple Metal is supported on macOS (M1 Pro and above recommended).

For multi-GPU tensor parallelism and quantization guidance, see GPU Workers (OSS) →.

Monitoring

Worker Status

Go to Settings → Providers to check GPU worker health, last heartbeat time, and current status. Or ask the admin agent:

Show me GPU worker status

Dead Worker Detection

If a GPU worker stops sending heartbeats for longer than AUXOT_WORKER_DEAD_THRESHOLD (default: 90 seconds), the router marks it as dead and stops routing requests to it. When the worker comes back online and resumes heartbeats, it’s automatically re-added to the routing pool.

Metrics

GPU workers expose Prometheus-compatible metrics on port 9090 (configurable via AUXOT_METRICS_PORT):

auxot_worker_inference_duration_seconds — histogram of inference latency
auxot_worker_tokens_generated_total — counter of tokens produced
auxot_worker_vram_used_bytes — current VRAM utilization
auxot_worker_model_loaded — gauge indicating which models are loaded