Semantic caching
Provide relevant context for LLM provider by retrieving data from one or more datasets.
About semantic caching
Semantic caching stores the data based on its meaning. If two prompts sent to the LLM provider are semantically similar, the LLM response from the first prompt can be reused for the second prompt, without sending a request to the LLM. This reduces the number of requests to the LLM provider, improves the response time, and reduces the cost.
In the following tutorial, you configure Redis as the caching datastore and enable semantic caching in Gloo Gateway.
Before you begin
Complete the Authenticate with API keys tutorial.
Set up semantic caching
Deploy a Redis instance to use as the datastore to cache semantically similar requests.
kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: redis-cache namespace: gloo-system labels: app: redis-cache spec: replicas: 1 selector: matchLabels: app: redis-cache template: metadata: labels: app: redis-cache spec: containers: - name: redis image: redis/redis-stack-server:latest imagePullPolicy: IfNotPresent ports: - containerPort: 6379 --- apiVersion: v1 kind: Service metadata: name: redis-cache namespace: gloo-system spec: selector: app: redis-cache ports: - protocol: TCP port: 6379 targetPort: 6379 EOF
Configure semantic caching with the RouteOption resource and reference the Redis datastore.
kubectl apply -f - <<EOF apiVersion: gateway.solo.io/v1 kind: RouteOption metadata: name: openai-opt namespace: gloo-system spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: openai options: ai: semanticCache: datastore: redis: connectionString: redis://redis-cache:6379 embedding: openai: authToken: secretRef: name: openai-secret namespace: gloo-system EOF
Send a request to the OpenAI endpoint. Verify the
x-envoy-upstream-service-time
header value for the first request. The value indicates the time taken to process the request.curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }'
Example output:
... < x-envoy-upstream-service-time: 1748 ...
Repeat the request. Verify that the response now includes the header
x-gloo-semantic-cache: hit
, which indicates that semantic caching was used to respond to the request. Also, notice that the response time is reduced.curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }'
Example output:
... < x-gloo-semantic-cache: hit ...
Manually control the cache
In the previous example, you learned how to automatically add cache entries for semantically similar requests. Because semantic caching can lead to false cache entries, you might want to manually control the cache, and add or remove a cache entry for a specific request.
Gloo Gateway comes with a built-in ai-extension-apiserver
component that exposes a REST API that you can use to add or remove cache entries for specific requests. The ai-extension-apiserver
is deployed by using the Gloo Gateway Helm chart.
Get the Helm values files for your current Gloo Gateway installation.
helm get values gloo-gateway -n gloo-system -o yaml > gloo-gateway.yaml open gloo-gateway.yaml
Add the following section to your Helm values file to include the
ai-extension-apiserver
component. You use this component to manually control cache entries.global: extensions: aiExtension: apiServer: enabled: true
Upgrade your release.
helm repo update helm upgrade -i gloo-gateway glooe/gloo-ee \ --namespace gloo-system \ -f gloo-gateway.yaml \
Check that the
ai-extension-apiserver
component is running.kubectl get deploy -n gloo-system ai-extension-apiserver
Example output:
NAME READY UP-TO-DATE AVAILABLE AGE ai-extension-apiserver 1/1 1 1 3h54m
Update the semantic caching configuration in the RouteOption resource to enable
READ_ONLY
mode.kubectl apply -f - <<EOF apiVersion: gateway.solo.io/v1 kind: RouteOption metadata: name: openai-opt namespace: gloo-system spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: openai options: ai: semanticCache: mode: READ_ONLY datastore: redis: connectionString: redis://redis-cache:6379 embedding: openai: authToken: secretRef: name: openai-secret namespace: gloo-system EOF
Create an HTTPRoute resource to expose the
ai-extension-apiserver
component.kubectl apply -f - <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: cache-service namespace: gloo-system spec: parentRefs: - name: ai-gateway rules: - matches: - path: type: PathPrefix value: /cache filters: - type: URLRewrite urlRewrite: path: replacePrefixMatch: / type: ReplacePrefixMatch backendRefs: - name: ai-extension-apiserver namespace: gloo-system port: 8000 EOF
Optional: View the Swagger documentation for the
api-extension-apiserver
component by navigating to the$INGRESS_GW_ADDRESS:8080/cache/docs
endpoint in your web browser.Send a request to the
ai-extension-apiserver
component to clear the cached request from the previous tutorial.When interacting with the
ai-extension-apiserver
Rest API, all endpoints must include thecache_id
path parameter. Thecache_id
parameter represents thenamespace.name
of the RouteOption resource that is used to cache the request. In this example, thecache_id
isgloo-system.openai-opt
.curl -X DELETE "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" \ -F "model=gpt-3.5-turbo" \ -F "stream=false"
Repeat the request from the previous tutorial. Verify that the
x-gloo-semantic-cache: hit
header is no longer present in the response.curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }'
Example output:
Note that the response might be slightly different depending on the LLM provider that you use.... { "id": "chatcmpl-A1YvB0dmwVem3gsTmpkvnl3QZUIb7", "object": "chat.completion", "created": 1724935929, "model": "gpt-3.5-turbo-0125", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "There are over 1,200 different varieties of cheeses in France.", "refusal": null }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 16, "completion_tokens": 14, "total_tokens": 30 }, "system_fingerprint": null } ...
Repeat the same request a few times. Verify that the request is not automatically cached and the
x-gloo-semantic-cache: hit
header is not returned.curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }'
Example output:
... { "id": "chatcmpl-A1YxsmOhZodbcKwoLMy4dqrtk9l7g", "object": "chat.completion", "created": 1724936096, "model": "gpt-3.5-turbo-0125", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "It is estimated that there are over 1,200 varieties of cheeses produced in France.", "refusal": null }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 16, "completion_tokens": 18, "total_tokens": 34 }, "system_fingerprint": null } ...
Send a request to the
ai-extension-apiserver
to add a cache entry for the request that you previously sent.echo '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }' > request.json echo '{ "id": "fake", "object": "chat.completion", "created": 1722966273, "model": "gpt-3.5-turbo", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.", "refusal": null }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 11, "completion_tokens": 310, "total_tokens": 321 }, "system_fingerprint": "fp_48196bc67a" }' > response.json curl -X PUT "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" -F "req=@request.json" -F "data=@response.json"
Send another request to the LLM. Verify that you get back the exact response from the
response.json
file that you manually added to the cache.curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }'
Example output:
... { "id": "fake", "object": "chat.completion", "created": 1722966273, "model": "gpt-3.5-turbo", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.", "refusal": null }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 11, "completion_tokens": 310, "total_tokens": 321 }, "system_fingerprint": "fp_48196bc67a" } ...
Next
You can optionally remove the resources that you set up as part of this guide. For steps, see Cleanup.