Rate limit requests
Limit the number of requests that can be sent to the LLM provider.
About rate limiting
Traditional API gateways typically implement rate limiting based on the number of requests per time period, which works well for standard REST APIs where each request has similar computational cost. However, LLM applications present a unique challenge: the computational cost varies dramatically based on the number of tokens processed.
LLM providers charge based on the number of input tokens (user prompts and system prompts) and output tokens (responses from the model), which can make uncontrolled usage very expensive. For example, a simple "Hello" prompt might consume 10 tokens, while a complex analysis request like "Produce a report with charts based on the template but using last quarter's sales data" could consume thousands.
Token-based rate limiting ensures fair resource allocation by accounting for the actual computational load of each request. It prevents users from overwhelming the system with token-heavy requests while allowing reasonable usage of lighter requests. With rate limits in place, your teams can stick to their LLM budgets and make sure their AI usage stays predictably within bounds. As such, token-based rate limiting is a key way to manage costs, ensure service stability, and secure your AI environment.
In the following tutorial, you extract claims from the JWT token for Alice that you created in the Control access tutorial. Then, you enforce rate limits based on the values of their JWT token claims.
Before you begin
Complete the Control access tutorial.
Set up rate limiting
Create a RateLimitConfig with your rate limit rules. The following example sets a user limit of 70 tokens per hour. The user ID is extracted from the
subclaim in the JWT token.kubectl apply -f- <<EOF apiVersion: ratelimit.solo.io/v1alpha1 kind: RateLimitConfig metadata: name: per-user-counter namespace: gloo-system spec: raw: descriptors: - key: user-id rateLimit: requestsPerUnit: 70 unit: HOUR rateLimits: - actions: - metadata: descriptorKey: user-id source: DYNAMIC default: unknown metadataKey: key: "envoy.filters.http.jwt_authn" path: - key: principal - key: sub EOFAdd the RateLimitConfig to your RouteOption resource by using the
spec.options.rateLimitConfigssection.kubectl apply -f- <<EOF apiVersion: gateway.solo.io/v1 kind: RouteOption metadata: name: openai-opt namespace: gloo-system spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: openai options: rateLimitConfigs: refs: - name: per-user-counter namespace: gloo-system EOFSend a request to the AI API and include the JWT token for Alice. Verify that the request succeeds.
curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair." }, { "role": "user", "content": "Compose a poem that explains the concept of recursion in programming." } ] }'In the output, note that the
usagesection shows how many tokens the request uses. Because this is the first request, the limit was not yet exceeded and so the request succeeds even if it exceeds the limit of 100 tokens per minute.{ "id": "chatcmpl-9bLT1ofadlXEMpo53LcGjHsv3S5Ry", "object": "chat.completion", "created": 1718687683, "model": "gpt-3.5-turbo-0125", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "In the realm of code, a concept so divine,\nRecursion weaves patterns, like nature's design.\nA function that calls itself, with purpose and grace,\nIt solves problems complex, with elegance and pace.\n\nLike a mirror reflecting its own reflection,\nRecursion repeats with boundless affection.\nEach iteration holds a story untold,\nUnraveling mysteries, a journey unfold.\n\nInfinite loops, a dangerous abyss,\nRecursion beckons with a siren's sweet kiss.\nBase case in" }, "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 39, "completion_tokens": 100, "total_tokens": 139 }, "system_fingerprint": null }Repeat the request. Verify that the request is now rate limited and that you get back a 429 HTTP response code, because only 70 tokens per hour are allowed for a particular user.
curl -v "${INGRESS_GW_ADDRESS}:8080/openai" -H "Authorization: Bearer $ALICE_TOKEN" -H content-type:application/json -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair." }, { "role": "user", "content": "Compose a poem that explains the concept of recursion in programming." } ] }'Example output:
< HTTP/1.1 429 Too Many Requests < x-envoy-ratelimited: true < date: Tue, 18 Jun 2024 05:15:13 GMT < server: envoy < content-length: 0
Cleanup
Before continuing with the next tutorial, you can clean up the JWT authentication resources that you created in this tutorial and the previous tutorial. Note that if you do not remove the resources, you must include JWT tokens with the correct access in all subsequent curl requests to the AI API.
kubectl delete VirtualHostOption jwt-provider -n gloo-system
kubectl delete RouteOption openai-opt -n gloo-system
kubectl delete RateLimitConfig per-user-counter -n gloo-system
Next
You can now explore how to effectively manage your LLM prompts.