Rate limit AI requests BETA
Rate limit requests through Solo Enterprise for agentgateway.
Global rate limits can be based on requests or large language model (LLM) token usage. You can also apply rate limits at different levels, such as to all routes on agentgateway or particular routes for LLM providers. This approach gives you more flexibility in how you protect your AI environment.
About
Traditional API gateways typically implement rate limiting based on the number of requests per time period, which works well for standard REST APIs where each request has similar computational cost. However, LLM applications present a unique challenge: the computational cost varies dramatically based on the number of tokens processed.
LLM providers charge based on the number of input tokens (user prompts and system prompts) and output tokens (responses from the model), which can make uncontrolled usage very expensive. For example, a simple "Hello" prompt might consume 10 tokens, while a complex analysis request like "Produce a report with charts based on the template but using last quarter's sales data" could consume thousands.
Token-based rate limiting ensures fair resource allocation by accounting for the actual computational load of each request. It prevents users from overwhelming the system with token-heavy requests while allowing reasonable usage of lighter requests. With rate limits in place, your teams can stick to their LLM budgets and make sure their AI usage stays predictably within bounds. As such, token-based rate limiting is a key way to manage costs, ensure service stability, and secure your AI environment.
For more details about how rate limiting works in agentgateway enterprise or request-based rate limiting, see the Rate limiting docs.
Before you begin
Agentgateway rate limit
Set up a global rate limit for the number of requests to any route through agentgateway.
Create a RateLimitConfig with your rate limit rules. To indicate that the rate limit counts requests and not tokens, include the
type: REQUEST"field. The following example sets a global limit of 5 requests per minute.kubectl apply -f- <<EOF apiVersion: ratelimit.solo.io/v1alpha1 kind: RateLimitConfig metadata: name: global-rate-limit namespace: gloo-system spec: raw: descriptors: - key: generic_key value: counter rateLimit: requestsPerUnit: 5 unit: MINUTE rateLimits: - actions: - genericKey: descriptorValue: counter type: REQUEST EOFApply the rate limit by using a GlooTrafficPolicy resource. To indicate that the rate limit counts requests and not tokens, include the
type: REQUEST"field. The following example targets theagentgatewayHTTPRoute that you set up before you began.kubectl apply -f- <<EOF apiVersion: gloo.solo.io/v1alpha1 kind: GlooTrafficPolicy metadata: name: global-rate-limit namespace: gloo-system spec: targetRefs: - group: gateway.networking.k8s.io kind: Gateway name: agentgateway glooRateLimit: global: rateLimitConfigRef: name: global-rate-limit EOFSend a simple request to a route on agentgateway. Verify that the request succeeds.
Repeat the request. On the sixth time, verify that the request is now rate limited and that you get back a 429 HTTP response code, because only 5 requests per minute are allowed through your agentgateway.
To test the rate limit by running the request multiple times, you can use a loop:
Example output:
< HTTP/1.1 429 Too Many Requests < x-envoy-ratelimited: true < date: Tue, 18 Jun 2024 05:15:13 GMT < server: envoy < content-length: 0
LLM provider rate limit
If you have routes to multiple LLM providers, you can enforce a rate limit for each provider.
Create a RateLimitConfig with the rate limit rules for the LLM provider. To indicate that the rate limit counts requests and not tokens, include the
type: REQUEST"field. The following example sets a limit of 2 requests per minute.kubectl apply -f- <<EOF apiVersion: ratelimit.solo.io/v1alpha1 kind: RateLimitConfig metadata: name: openai-rate-limit namespace: gloo-system spec: raw: descriptors: - key: generic_key value: counter rateLimit: requestsPerUnit: 2 unit: MINUTE rateLimits: - actions: - genericKey: descriptorValue: counter type: REQUEST EOFApply the rate limit by using a GlooTrafficPolicy resource. The following example targets the
openaiHTTPRoute that you set up before you began.kubectl apply -f- <<EOF apiVersion: gloo.solo.io/v1alpha1 kind: GlooTrafficPolicy metadata: name: openai-rate-limit namespace: gloo-system spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: openai glooRateLimit: global: rateLimitConfigRef: name: openai-rate-limit EOFSend a simple request to the OpenAI API. Verify that the request succeeds.
Example output:
{ "id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry", "object": "chat.completion", "created": 1718687683, "model": "gpt-3.5-turbo-0125", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in" }, "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 39, "completion_tokens": 100, "total_tokens": 139 }, "system_fingerprint": null }Repeat the request two more times. Verify that the request is now rate limited and that you get back a 429 HTTP response code, because only 2 requests per minute are allowed for OpenAI.
Example output:< HTTP/1.1 429 Too Many Requests < x-envoy-ratelimited: true < date: Tue, 18 Jun 2024 05:15:13 GMT < server: envoy < content-length: 0
Token-based rate limit
Instead of request-based rate limiting, you can apply a rate limit based on the number of tokens used. This approach helps make your costs and AI usage more predictable.
Update your OpenAI RateLimitConfig with your rate limit for tokens. Add a rate limit descriptor and action pair that set
type: TOKEN. The following example adds a user limit of 100 tokens per minute.kubectl apply -f- <<EOF apiVersion: ratelimit.solo.io/v1alpha1 kind: RateLimitConfig metadata: name: openai-rate-limit namespace: gloo-system spec: raw: descriptors: - key: X-User-ID rateLimit: unit: MINUTE requestsPerUnit: 100 rateLimits: - actions: - requestHeaders: descriptorKey: "X-User-ID" headerName: "X-User-ID" type: TOKEN EOFCheck that the rate limit is still applied by the GlooTrafficPolicy that selects your updated RateLimitConfig.
kubectl get GlooTrafficPolicy openai-rate-limit -n gloo-system -o yamlExample output:
status: ancestors: - ancestorRef: group: gateway.networking.k8s.io kind: HTTPRoute name: openai namespace: gloo-system conditions: - lastTransitionTime: "2025-09-29T16:58:37Z" message: Policy accepted reason: Valid status: "True" type: Accepted - lastTransitionTime: "2025-09-29T16:58:37Z" message: Attached to all targets reason: Attached status: "True" type: Attached controllerName: solo.io/agentgateway - ancestorRef: group: gateway.networking.k8s.io kind: HTTPRoute name: openai namespace: gloo-system conditions: - lastTransitionTime: "2025-09-29T16:58:37Z" message: Policy accepted reason: Accepted status: "True" type: Accepted controllerName: solo.io/agentgatewaySend a simple request to the OpenAI API. Verify that the request succeeds. Include the
X-User-IDrequest header with the valueuser123.In the output, note that the
usagesection shows how many total tokens the request uses, such as 256. Because this is the first request, the limit was not yet exceeded and so the request succeeds even though it exceeds the limit of 100 tokens per minute.{ "id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry", "object": "chat.completion", "created": 1718687683, "model": "gpt-3.5-turbo-0125", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in" }, "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 48, "completion_tokens": 208, "total_tokens": 256 }, "system_fingerprint": null }Repeat the request. Verify that the request is now rate limited and that you get back a 429 HTTP response code.
Example output:< HTTP/1.1 429 Too Many Requests < x-envoy-ratelimited: true < date: Tue, 18 Jun 2024 05:15:13 GMT < server: envoy < content-length: 0
Cleanup
You can remove the resources that you created in this guide.
kubectl delete RateLimitConfig global-rate-limit -n gloo-system
kubectl delete RateLimitConfig openai-rate-limit -n gloo-system
kubectl delete GlooTrafficPolicy global-rate-limit -n gloo-system
kubectl delete GlooTrafficPolicy openai-rate-limit -n gloo-system