About semantic caching

Semantic caching stores the data based on its meaning. If two prompts sent to the LLM provider are semantically similar, the LLM response from the first prompt can be reused for the second prompt, without sending a request to the LLM. This reduces the number of requests to the LLM provider, improves the response time, and reduces the cost.

In the following tutorial, you configure Redis as the caching datastore and enable semantic caching in Gloo Gateway.

Before you begin

Complete the Authenticate with API keys tutorial.

Set up semantic caching

  1. Deploy a Redis instance to use as the datastore to cache semantically similar requests.

      kubectl apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: redis-cache
      namespace: gloo-system
      labels:
        app: redis-cache
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: redis-cache
      template:
        metadata:
          labels:
            app: redis-cache
        spec:
          containers:
          - name: redis
            image: redis/redis-stack-server:latest
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: redis-cache
      namespace: gloo-system
    spec:
      selector:
        app: redis-cache
      ports:
      - protocol: TCP
        port: 6379
        targetPort: 6379
    EOF
      
  2. Configure semantic caching with the RouteOption resource and reference the Redis datastore.

      kubectl apply -f - <<EOF
    apiVersion: gateway.solo.io/v1
    kind: RouteOption
    metadata:
      name: openai-opt
      namespace: gloo-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: openai
      options:
        ai:
          semanticCache:
            datastore:
              redis:
                connectionString: redis://redis-cache:6379
            embedding:
              openai:
                authToken:
                  secretRef:
                    name: openai-secret
                    namespace: gloo-system
    EOF
      
  3. Send a request to the OpenAI endpoint. Verify the x-envoy-upstream-service-time header value for the first request. The value indicates the time taken to process the request.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
     "model": "gpt-3.5-turbo",
     "messages": [
       {
         "role": "user",
         "content": "How many varieties of cheeses are in France?"
       }
     ]
    }'
      

    Example output:

       ...
     < x-envoy-upstream-service-time: 1748
     ...
      
  4. Repeat the request. Verify that the response now includes the header x-gloo-semantic-cache: hit, which indicates that semantic caching was used to respond to the request. Also, notice that the response time is reduced.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
     "model": "gpt-3.5-turbo",
     "messages": [
       {
         "role": "user",
         "content": "How many varieties of cheeses are in France?"
       }
     ]
    }'
      

    Example output:

      ...
    < x-gloo-semantic-cache: hit
    ...
      

Manually control the cache

In the previous example, you learned how to automatically add cache entries for semantically similar requests. Because semantic caching can lead to false cache entries, you might want to manually control the cache, and add or remove a cache entry for a specific request.

Gloo Gateway comes with a built-in ai-extension-apiserver component that exposes a REST API that you can use to add or remove cache entries for specific requests. The ai-extension-apiserver is deployed by using the Gloo Gateway Helm chart.

  1. Get the Helm values files for your current Gloo Gateway installation.

       helm get values gloo-gateway -n gloo-system -o yaml > gloo-gateway.yaml
     open gloo-gateway.yaml
      
  2. Add the following section to your Helm values file to include the ai-extension-apiserver component. You use this component to manually control cache entries.

      global:
      extensions:
        aiExtension:
          apiServer:
            enabled: true
      
  3. Upgrade your release.

      helm repo update
    helm upgrade -i gloo-gateway glooe/gloo-ee \
      --namespace gloo-system \
      -f gloo-gateway.yaml \
      
  4. Check that the ai-extension-apiserver component is running.

      kubectl get deploy -n gloo-system ai-extension-apiserver
      

    Example output:

      NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
    ai-extension-apiserver   1/1     1            1           3h54m
      
  5. Update the semantic caching configuration in the RouteOption resource to enable READ_ONLY mode.

      kubectl apply -f - <<EOF
    apiVersion: gateway.solo.io/v1
    kind: RouteOption
    metadata:
      name: openai-opt
      namespace: gloo-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: openai
      options:
        ai:
          semanticCache:
            mode: READ_ONLY
            datastore:
              redis:
                connectionString: redis://redis-cache:6379
            embedding:
              openai:
                authToken:
                  secretRef:
                    name: openai-secret
                    namespace: gloo-system
    EOF
      
  6. Create an HTTPRoute resource to expose the ai-extension-apiserver component.

       kubectl apply -f - <<EOF
     apiVersion: gateway.networking.k8s.io/v1
     kind: HTTPRoute
     metadata:
       name: cache-service
       namespace: gloo-system
     spec:
       parentRefs:
         - name: ai-gateway
       rules:
       - matches:
         - path:
             type: PathPrefix
             value: /cache
         filters:
           - type: URLRewrite
             urlRewrite:
               path:
                 replacePrefixMatch: /
                 type: ReplacePrefixMatch
         backendRefs:
         - name: ai-extension-apiserver
           namespace: gloo-system
           port: 8000
     EOF
      
  7. Optional: View the Swagger documentation for the api-extension-apiserver component by navigating to the $INGRESS_GW_ADDRESS:8080/cache/docs endpoint in your web browser.

  8. Send a request to the ai-extension-apiserver component to clear the cached request from the previous tutorial.

    When interacting with the ai-extension-apiserver Rest API, all endpoints must include the cache_id path parameter. The cache_id parameter represents the namespace.name of the RouteOption resource that is used to cache the request. In this example, the cache_id is gloo-system.openai-opt.

      curl -X DELETE "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" \
        -F "model=gpt-3.5-turbo" \
        -F "stream=false"
      
  9. Repeat the request from the previous tutorial. Verify that the x-gloo-semantic-cache: hit header is no longer present in the response.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "How many varieties of cheeses are in France?"
        }
      ]
    }'
      

    Example output:

         ...
       {
         "id": "chatcmpl-A1YvB0dmwVem3gsTmpkvnl3QZUIb7",
         "object": "chat.completion",
         "created": 1724935929,
         "model": "gpt-3.5-turbo-0125",
         "choices": [
           {
             "index": 0,
             "message": {
               "role": "assistant",
               "content": "There are over 1,200 different varieties of cheeses in France.",
               "refusal": null
             },
             "logprobs": null,
             "finish_reason": "stop"
           }
         ],
         "usage": {
           "prompt_tokens": 16,
           "completion_tokens": 14,
           "total_tokens": 30
         },
         "system_fingerprint": null
       }
       ...
      
  10. Repeat the same request a few times. Verify that the request is not automatically cached and the x-gloo-semantic-cache: hit header is not returned.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "How many varieties of cheeses are in France?"
        }
      ]
    }'
      

    Example output:

         ...
       {
         "id": "chatcmpl-A1YxsmOhZodbcKwoLMy4dqrtk9l7g",
         "object": "chat.completion",
         "created": 1724936096,
         "model": "gpt-3.5-turbo-0125",
         "choices": [
           {
             "index": 0,
             "message": {
               "role": "assistant",
               "content": "It is estimated that there are over 1,200 varieties of cheeses produced in France.",
               "refusal": null
             },
             "logprobs": null,
             "finish_reason": "stop"
           }
         ],
         "usage": {
           "prompt_tokens": 16,
           "completion_tokens": 18,
           "total_tokens": 34
         },
         "system_fingerprint": null
       }
       ...
      
  11. Send a request to the ai-extension-apiserver to add a cache entry for the request that you previously sent.

      echo '{
        "model": "gpt-3.5-turbo",
        "messages": [
          {
            "role": "user",
            "content": "How many varieties of cheeses are in France?"
          }
        ]
      }' > request.json
    
    echo '{
      "id": "fake",
      "object": "chat.completion",
      "created": 1722966273,
      "model": "gpt-3.5-turbo",
      "choices": [
          {
              "index": 0,
              "message": {
                  "role": "assistant",
                  "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.",
                  "refusal": null
              },
              "logprobs": null,
              "finish_reason": "stop"
          }
      ],
      "usage": {
          "prompt_tokens": 11,
          "completion_tokens": 310,
          "total_tokens": 321
      },
      "system_fingerprint": "fp_48196bc67a"
    }' > response.json
    
    curl -X PUT "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" -F "req=@request.json" -F "data=@response.json"
      
  12. Send another request to the LLM. Verify that you get back the exact response from the response.json file that you manually added to the cache.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
        "model": "gpt-3.5-turbo",
        "messages": [
          {
            "role": "user",
            "content": "How many varieties of cheeses are in France?"
          }
        ]
      }'
      

    Example output:

         ...
       {
         "id": "fake",
         "object": "chat.completion",
         "created": 1722966273,
         "model": "gpt-3.5-turbo",
         "choices": [
             {
                 "index": 0,
                 "message": {
                     "role": "assistant",
                     "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.",
                     "refusal": null
                 },
                 "logprobs": null,
                 "finish_reason": "stop"
             }
         ],
         "usage": {
             "prompt_tokens": 11,
             "completion_tokens": 310,
             "total_tokens": 321
         },
         "system_fingerprint": "fp_48196bc67a"
       }
       ...
      

Next

You can optionally remove the resources that you set up as part of this guide. For steps, see Cleanup.