On this page

Semantic caching

Provide relevant context for an LLM provider by retrieving data from one or more datasets.

About semantic caching

Semantic caching stores data based on its meaning. If two prompts sent to the LLM provider are semantically similar, the LLM response for the first prompt can be reused for the second prompt, without sending a request to the LLM. This reduces the number of requests to the LLM provider, improves the response time, and reduces costs.

In the following tutorial, you configure Redis as the caching datastore and enable semantic caching in Gloo Gateway.

Before you begin

Complete the Authenticate with API keys tutorial.

Set up semantic caching

Deploy either a Redis or Weaviate instance to use as the datastore to cache semantically similar requests.

  kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
  namespace: gloo-system
  labels:
    app: redis-cache
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis-cache
  template:
    metadata:
      labels:
        app: redis-cache
    spec:
      containers:
      - name: redis
        image: redis/redis-stack-server:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 6379
---
apiVersion: v1
kind: Service
metadata:
  name: redis-cache
  namespace: gloo-system
spec:
  selector:
    app: redis-cache
  ports:
  - protocol: TCP
    port: 6379
    targetPort: 6379
EOF

  kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: weaviate
  namespace: gloo-system
  labels:
    app: weaviate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: weaviate
  template:
    metadata:
      labels:
        app: weaviate
    spec:
      containers:
        - name: weaviate
          image: cr.weaviate.io/semitechnologies/weaviate:1.26.4
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
            - containerPort: 50051
---
apiVersion: v1
kind: Service
metadata:
  name: weaviate
  namespace: gloo-system
  labels:
    app: weaviate
spec:
  selector:
    app: weaviate
  ports:
  - name: http
    port: 8080
  - name: grpc
    port: 50051
EOF

Send a request to the AI API. Note the x-envoy-upstream-service-time header value in the output first request, which indicates the time taken to process the request.

  curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
 "model": "gpt-3.5-turbo",
 "messages": [
   {
     "role": "user",
     "content": "How many varieties of cheeses are in France?"
   }
 ]
}'

Example output:

  ...
< x-envoy-upstream-service-time: 842
...

In your RouteOption resource, add the following spec.options.ai.semanticCache section to configure semantic caching by using the Redis datastore.

  kubectl apply -f - <<EOF
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: openai-opt
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    ai:
      semanticCache:
        datastore:
          redis:
            connectionString: redis://redis-cache:6379
        embedding:
          openai:
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
EOF

  kubectl apply -f - <<EOF
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: openai-opt
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    ai:
      semanticCache:
        datastore:
          weaviate:
            host: weaviate.gloo-system.svc.cluster.local
            httpPort: 8080
            grpcPort: 50051
            insecure: true
        embedding:
          openai:
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
EOF

Repeat the request. Verify that the response now includes the header x-gloo-semantic-cache: hit, which indicates that semantic caching was used to respond to the request. Also, notice that the response time is reduced.

  curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
 "model": "gpt-3.5-turbo",
 "messages": [
   {
     "role": "user",
     "content": "How many varieties of cheeses are in France?"
   }
 ]
}'

Example output:

  ...
< x-gloo-semantic-cache: hit
< x-envoy-upstream-service-time: 614
...

Manually control the cache

In the previous example, you learned how to automatically add cache entries for semantically similar requests. Because semantic caching can lead to false cache entries, you might want to manually control the cache, and add or remove a cache entry for a specific request.

Gloo Gateway comes with a built-in ai-extension-apiserver component that exposes a REST API that you can use to add or remove cache entries for specific requests. The ai-extension-apiserver is deployed by using the Gloo Gateway Helm chart.

Get the Helm values files for your current Gloo Gateway installation.

  helm get values gloo -n gloo-system -o yaml > gloo-gateway.yaml
open gloo-gateway.yaml

Add the following section to your Helm values file to include the ai-extension-apiserver component. You use this component to manually control cache entries.
```
  
global:
  extensions:
    aiExtension:
      apiServer:
        enabled: true
  
```

Upgrade your release.

  helm repo update
helm upgrade -i gloo glooe/gloo-ee \
  --namespace gloo-system \
  -f gloo-gateway.yaml \
  --version=1.19.0-beta2

Check that the ai-extension-apiserver component is running.

  kubectl get deploy -n gloo-system ai-extension-apiserver

Example output:

  NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
ai-extension-apiserver   1/1     1            1           3h54m

Update the semantic caching configuration in the RouteOption resource to enable READ_ONLY mode.

  kubectl apply -f - <<EOF
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: openai-opt
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    ai:
      semanticCache:
        mode: READ_ONLY
        datastore:
          redis:
            connectionString: redis://redis-cache:6379
        embedding:
          openai:
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
EOF

Create an HTTPRoute resource to expose the ai-extension-apiserver component.

  kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: cache-service
  namespace: gloo-system
spec:
  parentRefs:
    - name: ai-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /cache
    filters:
      - type: URLRewrite
        urlRewrite:
          path:
            replacePrefixMatch: /
            type: ReplacePrefixMatch
    backendRefs:
    - name: ai-extension-apiserver
      namespace: gloo-system
      port: 8000
EOF

Optional: View the Swagger documentation for the api-extension-apiserver component by navigating to the $INGRESS_GW_ADDRESS:8080/cache/docs endpoint in your web browser.
```
  open http://$INGRESS_GW_ADDRESS:8080/cache/docs
  
```
Send a request to the ai-extension-apiserver component to clear the cached request from the previous tutorial.
When interacting with the ai-extension-apiserver Rest API, all endpoints must include the cache_id path parameter. The cache_id parameter represents the namespace.name of the RouteOption resource that is used to cache the request. In this example, the cache_id is gloo-system.openai-opt.
```
  curl -X DELETE "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" \
    -F "model=gpt-3.5-turbo" \
    -F "stream=false"
  
```

Repeat the request from the previous tutorial. Verify that the x-gloo-semantic-cache: hit header is no longer present in the response.

  curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "How many varieties of cheeses are in France?"
    }
  ]
}'

Example output:

notifications

Note that the response might be slightly different depending on the LLM provider that you use.

     ...
   {
     "id": "chatcmpl-A1YvB0dmwVem3gsTmpkvnl3QZUIb7",
     "object": "chat.completion",
     "created": 1724935929,
     "model": "gpt-3.5-turbo-0125",
     "choices": [
       {
         "index": 0,
         "message": {
           "role": "assistant",
           "content": "There are over 1,200 different varieties of cheeses in France.",
           "refusal": null
         },
         "logprobs": null,
         "finish_reason": "stop"
       }
     ],
     "usage": {
       "prompt_tokens": 16,
       "completion_tokens": 14,
       "total_tokens": 30
     },
     "system_fingerprint": null
   }
   ...

Repeat the same request a few times. Verify that the request is not automatically cached and the x-gloo-semantic-cache: hit header is not returned.

  curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "How many varieties of cheeses are in France?"
    }
  ]
}'

Example output:

     ...
   {
     "id": "chatcmpl-A1YxsmOhZodbcKwoLMy4dqrtk9l7g",
     "object": "chat.completion",
     "created": 1724936096,
     "model": "gpt-3.5-turbo-0125",
     "choices": [
       {
         "index": 0,
         "message": {
           "role": "assistant",
           "content": "It is estimated that there are over 1,200 varieties of cheeses produced in France.",
           "refusal": null
         },
         "logprobs": null,
         "finish_reason": "stop"
       }
     ],
     "usage": {
       "prompt_tokens": 16,
       "completion_tokens": 18,
       "total_tokens": 34
     },
     "system_fingerprint": null
   }
   ...

Send a request to the ai-extension-apiserver to add a cache entry for the request that you previously sent.

  echo '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "How many varieties of cheeses are in France?"
      }
    ]
  }' > request.json

echo '{
  "id": "fake",
  "object": "chat.completion",
  "created": 1722966273,
  "model": "gpt-3.5-turbo",
  "choices": [
      {
          "index": 0,
          "message": {
              "role": "assistant",
              "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.",
              "refusal": null
          },
          "logprobs": null,
          "finish_reason": "stop"
      }
  ],
  "usage": {
      "prompt_tokens": 11,
      "completion_tokens": 310,
      "total_tokens": 321
  },
  "system_fingerprint": "fp_48196bc67a"
}' > response.json

curl -X PUT "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" -F "req=@request.json" -F "data=@response.json"

Send another request to the LLM. Verify that you get back the exact response from the response.json file that you manually added to the cache.

  curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "How many varieties of cheeses are in France?"
      }
    ]
  }'

Example output:

     ...
   {
     "id": "fake",
     "object": "chat.completion",
     "created": 1722966273,
     "model": "gpt-3.5-turbo",
     "choices": [
         {
             "index": 0,
             "message": {
                 "role": "assistant",
                 "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.",
                 "refusal": null
             },
             "logprobs": null,
             "finish_reason": "stop"
         }
     ],
     "usage": {
         "prompt_tokens": 11,
         "completion_tokens": 310,
         "total_tokens": 321
     },
     "system_fingerprint": "fp_48196bc67a"
   }
   ...

You can optionally remove the resources that you set up as part of this guide. For steps, see Cleanup.

Semantic caching

About semantic caching link

Before you begin link

Set up semantic caching link

Manually control the cache link

Next link

About semantic caching

Before you begin

Set up semantic caching

Manually control the cache

Next