Configuration¶
Environment Variables¶
| Variable | Required | Default | Description |
|---|---|---|---|
GCP_PROJECT |
Yes | — | Google Cloud project ID |
API_KEYS |
Yes* | — | Comma-separated API keys for authentication. *Not required if using Secret Manager. |
PORT |
No | 8080 |
Port the service listens on |
RATE_LIMIT_PER_SECOND |
No | 10 |
Token bucket refill rate (requests per second) |
RATE_LIMIT_BURST |
No | 100 |
Maximum burst capacity for the rate limiter |
MAX_CONCURRENT_GPU |
No | 2 |
Maximum concurrent GPU inference operations |
GPU_TIMEOUT |
No | 120 |
Seconds to wait for a GPU slot before returning 503 |
Authentication¶
All endpoints except GET /health and GET /metrics require an API key via the X-API-Key header.
Option A: Environment Variable (recommended for simple deployments)¶
Set the API_KEYS environment variable with one or more comma-separated keys:
Option B: Google Secret Manager¶
Store keys in a secret named inference-api-keys in your GCP project, with one key per line. The service automatically reads from Secret Manager when GCP_PROJECT is set.
Fallback behavior: The service tries Secret Manager first. If the secret is not found or inaccessible, it falls back to the API_KEYS environment variable.
Rate Limiting¶
The service uses a token bucket algorithm for rate limiting, combined with GPU concurrency limiting.
Token Bucket¶
RATE_LIMIT_PER_SECONDcontrols how many tokens are added to the bucket each second.RATE_LIMIT_BURSTcontrols the maximum number of tokens in the bucket.- Each request consumes one token. When the bucket is empty, requests receive a
429 Too Many Requestsresponse with aRetry-Afterheader.
GPU Concurrency¶
MAX_CONCURRENT_GPUlimits how many inference operations can run on the GPU simultaneously.- When all GPU slots are occupied and the timeout (
GPU_TIMEOUT) is reached, requests receive a503 Service Unavailableresponse.
This prevents GPU out-of-memory errors under high load.
Cloud Run Deployment Flags¶
Recommended deployment configuration:
| Flag | Value | Notes |
|---|---|---|
--gpu |
1 |
One GPU per instance |
--gpu-type |
nvidia-l4 |
L4 GPU (required) |
--cpu |
4 |
4 vCPUs |
--memory |
16Gi |
16 GB RAM |
--min-instances |
0 |
Scale to zero when idle |
--max-instances |
1 |
Adjust based on expected load |
--timeout |
300 |
5-minute request timeout |
--concurrency |
4 |
Max concurrent requests per instance |
--execution-environment |
gen2 |
Required for GPU support |