What GPU Workers Do
GPU workers run open-weight GGUF models using llama.cpp as the inference backend. They connect outbound to your Auxot router and serve completions at zero marginal cost — you only pay for hardware and electricity. GPU workers are the highest-priority provider type, so when one is online and supports the requested model, it handles the request before CLI or cloud providers are considered.
Setup
1. Create a Worker Key
Generate a worker key through Settings → Providers or ask the admin agent (Create a new GPU worker key).
2. Install and Run the Worker
Auxot Server uses the same open-source auxot-worker binary. See GPU Workers (OSS) → for installation, CLI flags, model selection, quantization, multi-GPU setup, and air-gapped deployment.
Point the worker at your server with the key from step 1:
export AUXOT_ROUTER_URL=https://ai.yourcompany.com
export AUXOT_GPU_KEY=<worker-key-from-admin>
auxot-worker --type gpu
On startup, the worker:
- Connects to the router and registers itself.
- Reports available VRAM, loaded models, and concurrency capacity.
- Begins sending heartbeats every 30 seconds.
Auto-Loading Models
Configure models to load automatically on worker startup:
export AUXOT_GPU_AUTOLOAD_MODELS=llama-3.3-70b-q4_k_m,qwen-2.5-coder-32b-q6_k
Hardware Requirements
Minimum viable configurations:
| GPU | VRAM | Recommended Models |
|---|---|---|
| NVIDIA RTX 4090 | 24 GB | 7B–14B at q8, 32B at q4 |
| NVIDIA A100 (40GB) | 40 GB | 70B at q4 |
| NVIDIA A100 (80GB) | 80 GB | 70B at q8 |
| NVIDIA H100 | 80 GB | 70B at q8 with high throughput |
| 2× RTX 4090 | 48 GB | 70B at q4 (tensor parallel) |
AMD ROCm GPUs are supported on Linux. Apple Metal is supported on macOS (M1 Pro and above recommended).
For multi-GPU tensor parallelism and quantization guidance, see GPU Workers (OSS) →.
Monitoring
Worker Status
Go to Settings → Providers to check GPU worker health, last heartbeat time, and current status. Or ask the admin agent:
Show me GPU worker status
Dead Worker Detection
If a GPU worker stops sending heartbeats for longer than AUXOT_WORKER_DEAD_THRESHOLD (default: 90 seconds), the router marks it as dead and stops routing requests to it. When the worker comes back online and resumes heartbeats, it’s automatically re-added to the routing pool.
Metrics
GPU workers expose Prometheus-compatible metrics on port 9090 (configurable via AUXOT_METRICS_PORT):
auxot_worker_inference_duration_seconds— histogram of inference latencyauxot_worker_tokens_generated_total— counter of tokens producedauxot_worker_vram_used_bytes— current VRAM utilizationauxot_worker_model_loaded— gauge indicating which models are loaded