Skip to content

Configuration

Environment Variables

Variable Required Default Description
GCP_PROJECT Yes Google Cloud project ID
API_KEYS Yes* Comma-separated API keys for authentication. *Not required if using Secret Manager.
PORT No 8080 Port the service listens on
RATE_LIMIT_PER_SECOND No 10 Token bucket refill rate (requests per second)
RATE_LIMIT_BURST No 100 Maximum burst capacity for the rate limiter
MAX_CONCURRENT_GPU No 2 Maximum concurrent GPU inference operations
GPU_TIMEOUT No 120 Seconds to wait for a GPU slot before returning 503

Authentication

All endpoints except GET /health and GET /metrics require an API key via the X-API-Key header.

Set the API_KEYS environment variable with one or more comma-separated keys:

--set-env-vars "API_KEYS=key1,key2,key3"

Option B: Google Secret Manager

Store keys in a secret named inference-api-keys in your GCP project, with one key per line. The service automatically reads from Secret Manager when GCP_PROJECT is set.

Fallback behavior: The service tries Secret Manager first. If the secret is not found or inaccessible, it falls back to the API_KEYS environment variable.

Rate Limiting

The service uses a token bucket algorithm for rate limiting, combined with GPU concurrency limiting.

Token Bucket

  • RATE_LIMIT_PER_SECOND controls how many tokens are added to the bucket each second.
  • RATE_LIMIT_BURST controls the maximum number of tokens in the bucket.
  • Each request consumes one token. When the bucket is empty, requests receive a 429 Too Many Requests response with a Retry-After header.

GPU Concurrency

  • MAX_CONCURRENT_GPU limits how many inference operations can run on the GPU simultaneously.
  • When all GPU slots are occupied and the timeout (GPU_TIMEOUT) is reached, requests receive a 503 Service Unavailable response.

This prevents GPU out-of-memory errors under high load.

Cloud Run Deployment Flags

Recommended deployment configuration:

Flag Value Notes
--gpu 1 One GPU per instance
--gpu-type nvidia-l4 L4 GPU (required)
--cpu 4 4 vCPUs
--memory 16Gi 16 GB RAM
--min-instances 0 Scale to zero when idle
--max-instances 1 Adjust based on expected load
--timeout 300 5-minute request timeout
--concurrency 4 Max concurrent requests per instance
--execution-environment gen2 Required for GPU support