Semantic caching
Provide relevant context for an LLM provider by retrieving data from one or more datasets.
About semantic caching
Semantic caching stores data based on its meaning. If two user prompts sent to the LLM provider are semantically similar, the LLM response for the first prompt can be reused for the second prompt, without sending a request to the LLM. This reduces the number of requests to the LLM provider, improves the response time, and reduces costs.
In the following tutorial, you configure Redis as the caching datastore and enable semantic caching in Gloo Gateway.
Before you begin
Complete the Authenticate with API keys tutorial.
Set up semantic caching
Deploy either a Redis or Weaviate instance to use as the datastore to cache semantically similar requests.
Send a request to the AI API. Note the
x-envoy-upstream-service-time
header value in the output first request, which indicates the time taken to process the request.Example output:
In your RouteOption resource, add the following
spec.options.ai.semanticCache
section to configure semantic caching by using the Redis datastore.Repeat the request. Verify that the response now includes the header
x-gloo-semantic-cache: hit
, which indicates that semantic caching was used to respond to the request. Also, notice that the response time is reduced.Example output:
Manually control the cache
In the previous example, you learned how to automatically add cache entries for semantically similar requests. Because semantic caching can lead to false cache entries, you might want to manually control the cache, and add or remove a cache entry for a specific request.
Gloo Gateway comes with a built-in ai-extension-apiserver
component that exposes a REST API that you can use to add or remove cache entries for specific requests. The ai-extension-apiserver
is deployed by using the Gloo Gateway Helm chart.
Get the Helm values files for your current Gloo Gateway installation.
Add the following section to your Helm values file to include the
ai-extension-apiserver
component. You use this component to manually control cache entries.Upgrade your release.
Check that the
ai-extension-apiserver
component is running.Example output:
Update the semantic caching configuration in the RouteOption resource to enable
READ_ONLY
mode.Create an HTTPRoute resource to expose the
ai-extension-apiserver
component.Optional: View the Swagger documentation for the
api-extension-apiserver
component by navigating to the$INGRESS_GW_ADDRESS:8080/cache/docs
endpoint in your web browser.Send a request to the
ai-extension-apiserver
component to clear the cached request from the previous tutorial.When interacting with the
ai-extension-apiserver
Rest API, all endpoints must include thecache_id
path parameter. Thecache_id
parameter represents thenamespace.name
of the RouteOption resource that is used to cache the request. In this example, thecache_id
isgloo-system.openai-opt
.Repeat the request from the previous tutorial. Verify that the
x-gloo-semantic-cache: hit
header is no longer present in the response.Example output:
Note that the response might be slightly different depending on the LLM provider that you use.Repeat the same request a few times. Verify that the request is not automatically cached and the
x-gloo-semantic-cache: hit
header is not returned.Example output:
Send a request to the
ai-extension-apiserver
to add a cache entry for the request that you previously sent.Send another request to the LLM. Verify that you get back the exact response from the
response.json
file that you manually added to the cache.Example output:
Next
You can optionally remove the resources that you set up as part of this guide. For steps, see Cleanup.