Global rate limits can be based on requests or large language model (LLM) token usage. You can also apply rate limits at different levels, such as to all routes on agentgateway or particular routes for LLM providers. This approach gives you more flexibility in how you protect your AI environment.

About

Traditional API gateways typically implement rate limiting based on the number of requests per time period, which works well for standard REST APIs where each request has similar computational cost. However, LLM applications present a unique challenge: the computational cost varies dramatically based on the number of tokens processed.

LLM providers charge based on the number of input tokens (user prompts and system prompts) and output tokens (responses from the model), which can make uncontrolled usage very expensive. For example, a simple "Hello" prompt might consume 10 tokens, while a complex analysis request like "Produce a report with charts based on the template but using last quarter's sales data" could consume thousands.

Token-based rate limiting ensures fair resource allocation by accounting for the actual computational load of each request. It prevents users from overwhelming the system with token-heavy requests while allowing reasonable usage of lighter requests. With rate limits in place, your teams can stick to their LLM budgets and make sure their AI usage stays predictably within bounds. As such, token-based rate limiting is a key way to manage costs, ensure service stability, and secure your AI environment.

For more details about how rate limiting works in agentgateway enterprise or request-based rate limiting, see the Rate limiting docs.

Before you begin

  1. Set up an agentgateway proxy.
  2. Set up access to the OpenAI LLM provider.

Agentgateway rate limit

Set up a global rate limit for the number of requests to any route through agentgateway.

  1. Create a RateLimitConfig with your rate limit rules. To indicate that the rate limit counts requests and not tokens, include the type: REQUEST" field. The following example sets a global limit of 5 requests per minute.

      kubectl apply -f- <<EOF
    apiVersion: ratelimit.solo.io/v1alpha1
    kind: RateLimitConfig
    metadata:
      name: global-rate-limit
      namespace: gloo-system
    spec:
      raw:
        descriptors:
        - key: generic_key
          value: counter
          rateLimit:
            requestsPerUnit: 5
            unit: MINUTE
        rateLimits:
        - actions:
          - genericKey:
              descriptorValue: counter
          type: REQUEST
    EOF
      
  2. Apply the rate limit by using a GlooTrafficPolicy resource. To indicate that the rate limit counts requests and not tokens, include the type: REQUEST" field. The following example targets the agentgateway HTTPRoute that you set up before you began.

      kubectl apply -f- <<EOF
    apiVersion: gloo.solo.io/v1alpha1
    kind: GlooTrafficPolicy
    metadata:
      name: global-rate-limit
      namespace: gloo-system
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: Gateway
          name: agentgateway
      glooRateLimit:
        global:
          rateLimitConfigRef:
            name: global-rate-limit
    EOF
      
  3. Send a simple request to a route on agentgateway. Verify that the request succeeds.

  4. Repeat the request. On the sixth time, verify that the request is now rate limited and that you get back a 429 HTTP response code, because only 5 requests per minute are allowed through your agentgateway.

    To test the rate limit by running the request multiple times, you can use a loop:

    Example output:

      < HTTP/1.1 429 Too Many Requests
    < x-envoy-ratelimited: true
    < date: Tue, 18 Jun 2024 05:15:13 GMT
    < server: envoy
    < content-length: 0
      

LLM provider rate limit

If you have routes to multiple LLM providers, you can enforce a rate limit for each provider.

  1. Create a RateLimitConfig with the rate limit rules for the LLM provider. To indicate that the rate limit counts requests and not tokens, include the type: REQUEST" field. The following example sets a limit of 2 requests per minute.

      kubectl apply -f- <<EOF
    apiVersion: ratelimit.solo.io/v1alpha1
    kind: RateLimitConfig
    metadata:
      name: openai-rate-limit
      namespace: gloo-system
    spec:
      raw:
        descriptors:
        - key: generic_key
          value: counter
          rateLimit:
            requestsPerUnit: 2
            unit: MINUTE
        rateLimits:
        - actions:
          - genericKey:
              descriptorValue: counter
          type: REQUEST
    EOF
      
  2. Apply the rate limit by using a GlooTrafficPolicy resource. The following example targets the openai HTTPRoute that you set up before you began.

      kubectl apply -f- <<EOF
    apiVersion: gloo.solo.io/v1alpha1
    kind: GlooTrafficPolicy
    metadata:
      name: openai-rate-limit
      namespace: gloo-system
    spec:
      targetRefs:
        - group: gateway.networking.k8s.io
          kind: HTTPRoute
          name: openai
      glooRateLimit:
        global:
          rateLimitConfigRef:
            name: openai-rate-limit
    EOF
      
  3. Send a simple request to the OpenAI API. Verify that the request succeeds.

    Example output:

      {
      "id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry",
      "object": "chat.completion",
      "created": 1718687683,
      "model": "gpt-3.5-turbo-0125",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in"
          },
          "logprobs": null,
          "finish_reason": "length"
        }
      ],
      "usage": {
        "prompt_tokens": 39,
        "completion_tokens": 100,
        "total_tokens": 139
      },
      "system_fingerprint": null
    }
      
  4. Repeat the request two more times. Verify that the request is now rate limited and that you get back a 429 HTTP response code, because only 2 requests per minute are allowed for OpenAI.

    Example output:

      < HTTP/1.1 429 Too Many Requests
    < x-envoy-ratelimited: true
    < date: Tue, 18 Jun 2024 05:15:13 GMT
    < server: envoy
    < content-length: 0
      

Token-based rate limit

Instead of request-based rate limiting, you can apply a rate limit based on the number of tokens used. This approach helps make your costs and AI usage more predictable.

  1. Update your OpenAI RateLimitConfig with your rate limit for tokens. Add a rate limit descriptor and action pair that set type: TOKEN. The following example adds a user limit of 100 tokens per minute.

      kubectl apply -f- <<EOF
    apiVersion: ratelimit.solo.io/v1alpha1
    kind: RateLimitConfig
    metadata:
      name: openai-rate-limit
      namespace: gloo-system
    spec:
      raw:
        descriptors:
        - key: X-User-ID
          rateLimit:
            unit: MINUTE
            requestsPerUnit: 100
        rateLimits:
        - actions:
          - requestHeaders:
              descriptorKey: "X-User-ID"
              headerName: "X-User-ID"
          type: TOKEN
    EOF
      
  2. Check that the rate limit is still applied by the GlooTrafficPolicy that selects your updated RateLimitConfig.

      kubectl get GlooTrafficPolicy openai-rate-limit -n gloo-system -o yaml
      

    Example output:

      status:
        ancestors:
        - ancestorRef:
            group: gateway.networking.k8s.io
            kind: HTTPRoute
            name: openai
            namespace: gloo-system
          conditions:
          - lastTransitionTime: "2025-09-29T16:58:37Z"
            message: Policy accepted
            reason: Valid
            status: "True"
            type: Accepted
          - lastTransitionTime: "2025-09-29T16:58:37Z"
            message: Attached to all targets
            reason: Attached
            status: "True"
            type: Attached
          controllerName: solo.io/agentgateway
        - ancestorRef:
            group: gateway.networking.k8s.io
            kind: HTTPRoute
            name: openai
            namespace: gloo-system
          conditions:
          - lastTransitionTime: "2025-09-29T16:58:37Z"
            message: Policy accepted
            reason: Accepted
            status: "True"
            type: Accepted
          controllerName: solo.io/agentgateway
      
  3. Send a simple request to the OpenAI API. Verify that the request succeeds. Include the X-User-ID request header with the value user123.

    In the output, note that the usage section shows how many total tokens the request uses, such as 256. Because this is the first request, the limit was not yet exceeded and so the request succeeds even though it exceeds the limit of 100 tokens per minute.

      {
      "id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry",
      "object": "chat.completion",
      "created": 1718687683,
      "model": "gpt-3.5-turbo-0125",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in"
          },
          "logprobs": null,
          "finish_reason": "length"
        }
      ],
      "usage": {
        "prompt_tokens": 48,
        "completion_tokens": 208,
        "total_tokens": 256
      },
      "system_fingerprint": null
    }
      
  4. Repeat the request. Verify that the request is now rate limited and that you get back a 429 HTTP response code.

    Example output:

      < HTTP/1.1 429 Too Many Requests
    < x-envoy-ratelimited: true
    < date: Tue, 18 Jun 2024 05:15:13 GMT
    < server: envoy
    < content-length: 0
      

Cleanup

You can remove the resources that you created in this guide.
  kubectl delete RateLimitConfig global-rate-limit -n gloo-system
kubectl delete RateLimitConfig openai-rate-limit -n gloo-system
kubectl delete GlooTrafficPolicy global-rate-limit -n gloo-system
kubectl delete GlooTrafficPolicy openai-rate-limit -n gloo-system