On this page

Rate limit requests

Limit the number of requests that can be sent to the LLM provider.

About rate limiting

Rate limiting on LLM provider token usage is primarily related to cost management, security, and service stability. LLM providers charge based on the number of input tokens (user prompts and system prompts) and output tokens (responses from the model), which can make uncontrolled usage very expensive. With Gloo AI Gateway, you can configure rate limiting based on LLM usage so that organizations can enforce budget constraints across groups, teams, departments, and individuals, and ensure that their usage remains within predictable bounds. That way, you can avoid unexpected costs and prevent malicious attacks to your LLM provider.

In the following tutorial, you extract claims from the JWT tokens for Alice and Bob that you created in the Control access tutorial. Then, you enforce rate limits based on the values of their JWT token claims.

Before you begin

Complete the Control access tutorial.

Set up rate limiting

Create a RateLimitConfig with your rate limit rules. The following example sets a user limit of 70 tokens per hour. The user ID is extracted from the sub claim in the JWT token.

  kubectl apply -f- <<EOF
apiVersion: ratelimit.solo.io/v1alpha1
kind: RateLimitConfig
metadata:
  name: per-user-counter
  namespace: gloo-system
spec:
  raw:
    descriptors:
    - key: user-id
      rateLimit:
        requestsPerUnit: 70
        unit: HOUR
    rateLimits:
    - actions:
      - metadata:
          descriptorKey: user-id
          source: DYNAMIC
          default: unknown
          metadataKey:
            key: "envoy.filters.http.jwt_authn"
            path:
            - key: principal
            - key: sub
EOF

Add the RateLimitConfig to your RouteOption resource by using the spec.options.rateLimitConfigs section.

  kubectl apply -f- <<EOF
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: openai-opt
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    rateLimitConfigs:
      refs:
      - name: per-user-counter
        namespace: gloo-system
EOF

Send a request to the AI API and include the JWT token for Alice. Verify that the request succeeds.

  curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
 "model": "gpt-3.5-turbo",
 "messages": [
   {
     "role": "system",
     "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
   },
   {
     "role": "user",
     "content": "Compose a poem that explains the concept of recursion in programming."
   }
 ]
}'

In the output, note that the usage section shows how many tokens the request uses:

  {
  "id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry",
  "object": "chat.completion",
  "created": 1718687683,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in"
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 39,
    "completion_tokens": 100,
    "total_tokens": 139
  },
  "system_fingerprint": null
}

Repeat the request. Verify that the request is now rate limited and that you get back a 429 HTTP response code, because only one 70 tokens per hour are allowed for a particular user.

  curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{
 "model": "gpt-3.5-turbo",
 "messages": [
   {
     "role": "system",
     "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
   },
   {
     "role": "user",
     "content": "Compose a poem that explains the concept of recursion in programming."
   }
 ]
}'

Example output:

  < HTTP/1.1 429 Too Many Requests
< x-envoy-ratelimited: true
< date: Tue, 18 Jun 2024 05:15:13 GMT
< server: envoy
< content-length: 0

Cleanup

Before continuing with the next tutorial, you can clean up the JWT authentication resources that you created in this tutorial and the previous tutorial. Note that if you do not remove the resources, you must include JWT tokens with the correct access in all subsequent curl requests to the AI API.

  kubectl delete VirtualHostOption jwt-provider -n gloo-system
kubectl delete RouteOption openai-opt -n gloo-system
kubectl delete RateLimitConfig per-user-counter -n gloo-system

You can now explore how to effectively manage your LLM prompts.

Rate limit requests

About rate limiting link

Before you begin link

Set up rate limiting link

Cleanup link

Next link

About rate limiting

Before you begin

Set up rate limiting

Cleanup

Next