What GPU Workers Do

GPU workers run open-weight GGUF models using llama.cpp as the inference backend. They connect outbound to your Auxot router and serve completions at zero marginal cost — you only pay for hardware and electricity. GPU workers are the highest-priority provider type, so when one is online and supports the requested model, it handles the request before CLI or cloud providers are considered.

Setup

1. Create a Worker Key

Generate a worker key through Settings → Providers or ask the admin agent (Create a new GPU worker key).

2. Install and Run the Worker

Auxot Server uses the same open-source auxot-worker binary. See GPU Workers (OSS) → for installation, CLI flags, model selection, quantization, multi-GPU setup, and air-gapped deployment.

Point the worker at your server with the key from step 1:

export AUXOT_ROUTER_URL=https://ai.yourcompany.com
export AUXOT_GPU_KEY=<worker-key-from-admin>

auxot-worker --type gpu

On startup, the worker:

  1. Connects to the router and registers itself.
  2. Reports available VRAM, loaded models, and concurrency capacity.
  3. Begins sending heartbeats every 30 seconds.

Auto-Loading Models

Configure models to load automatically on worker startup:

export AUXOT_GPU_AUTOLOAD_MODELS=llama-3.3-70b-q4_k_m,qwen-2.5-coder-32b-q6_k

Hardware Requirements

Minimum viable configurations:

GPUVRAMRecommended Models
NVIDIA RTX 409024 GB7B–14B at q8, 32B at q4
NVIDIA A100 (40GB)40 GB70B at q4
NVIDIA A100 (80GB)80 GB70B at q8
NVIDIA H10080 GB70B at q8 with high throughput
2× RTX 409048 GB70B at q4 (tensor parallel)

AMD ROCm GPUs are supported on Linux. Apple Metal is supported on macOS (M1 Pro and above recommended).

For multi-GPU tensor parallelism and quantization guidance, see GPU Workers (OSS) →.

Monitoring

Worker Status

Go to Settings → Providers to check GPU worker health, last heartbeat time, and current status. Or ask the admin agent:

Show me GPU worker status

Dead Worker Detection

If a GPU worker stops sending heartbeats for longer than AUXOT_WORKER_DEAD_THRESHOLD (default: 90 seconds), the router marks it as dead and stops routing requests to it. When the worker comes back online and resumes heartbeats, it’s automatically re-added to the routing pool.

Metrics

GPU workers expose Prometheus-compatible metrics on port 9090 (configurable via AUXOT_METRICS_PORT):

  • auxot_worker_inference_duration_seconds — histogram of inference latency
  • auxot_worker_tokens_generated_total — counter of tokens produced
  • auxot_worker_vram_used_bytes — current VRAM utilization
  • auxot_worker_model_loaded — gauge indicating which models are loaded