Configuration¶

Environment Variables¶

Variable	Required	Default	Description
`GCP_PROJECT`	Yes	—	Google Cloud project ID
`API_KEYS`	Yes*	—	Comma-separated API keys for authentication. *Not required if using Secret Manager.
`PORT`	No	`8080`	Port the service listens on
`RATE_LIMIT_PER_SECOND`	No	`10`	Token bucket refill rate (requests per second)
`RATE_LIMIT_BURST`	No	`100`	Maximum burst capacity for the rate limiter
`MAX_CONCURRENT_GPU`	No	`2`	Maximum concurrent GPU inference operations
`GPU_TIMEOUT`	No	`120`	Seconds to wait for a GPU slot before returning 503

Authentication¶

All endpoints except GET /health and GET /metrics require an API key via the X-API-Key header.

Option A: Environment Variable (recommended for simple deployments)¶

Set the API_KEYS environment variable with one or more comma-separated keys:

--set-env-vars "API_KEYS=key1,key2,key3"

Option B: Google Secret Manager¶

Store keys in a secret named inference-api-keys in your GCP project, with one key per line. The service automatically reads from Secret Manager when GCP_PROJECT is set.

Fallback behavior: The service tries Secret Manager first. If the secret is not found or inaccessible, it falls back to the API_KEYS environment variable.

Rate Limiting¶

The service uses a token bucket algorithm for rate limiting, combined with GPU concurrency limiting.

Token Bucket¶

RATE_LIMIT_PER_SECOND controls how many tokens are added to the bucket each second.
RATE_LIMIT_BURST controls the maximum number of tokens in the bucket.
Each request consumes one token. When the bucket is empty, requests receive a 429 Too Many Requests response with a Retry-After header.

GPU Concurrency¶

MAX_CONCURRENT_GPU limits how many inference operations can run on the GPU simultaneously.
When all GPU slots are occupied and the timeout (GPU_TIMEOUT) is reached, requests receive a 503 Service Unavailable response.

This prevents GPU out-of-memory errors under high load.

Cloud Run Deployment Flags¶

Recommended deployment configuration:

Flag	Value	Notes
`--gpu`	`1`	One GPU per instance
`--gpu-type`	`nvidia-l4`	L4 GPU (required)
`--cpu`	`4`	4 vCPUs
`--memory`	`16Gi`	16 GB RAM
`--min-instances`	`0`	Scale to zero when idle
`--max-instances`	`1`	Adjust based on expected load
`--timeout`	`300`	5-minute request timeout
`--concurrency`	`4`	Max concurrent requests per instance
`--execution-environment`	`gen2`	Required for GPU support