About semantic caching

Semantic caching stores data based on its meaning. If two prompts sent to the LLM provider are semantically similar, the LLM response for the first prompt can be reused for the second prompt, without sending a request to the LLM. This reduces the number of requests to the LLM provider, improves the response time, and reduces costs.

In the following tutorial, you configure Redis as the caching datastore and enable semantic caching in Gloo Gateway.

Before you begin

Complete the Authenticate with API keys tutorial.

Set up semantic caching

  1. Deploy either a Redis or Weaviate instance to use as the datastore to cache semantically similar requests.

  2. Send a request to the AI API. Note the x-envoy-upstream-service-time header value in the output first request, which indicates the time taken to process the request.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
     "model": "gpt-3.5-turbo",
     "messages": [
       {
         "role": "user",
         "content": "How many varieties of cheeses are in France?"
       }
     ]
    }'
      

    Example output:

      ...
    < x-envoy-upstream-service-time: 842
    ...
      
  3. In your RouteOption resource, add the following spec.options.ai.semanticCache section to configure semantic caching by using the Redis datastore.

  4. Repeat the request. Verify that the response now includes the header x-gloo-semantic-cache: hit, which indicates that semantic caching was used to respond to the request. Also, notice that the response time is reduced.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
     "model": "gpt-3.5-turbo",
     "messages": [
       {
         "role": "user",
         "content": "How many varieties of cheeses are in France?"
       }
     ]
    }'
      

    Example output:

      ...
    < x-gloo-semantic-cache: hit
    < x-envoy-upstream-service-time: 614
    ...
      

Manually control the cache

In the previous example, you learned how to automatically add cache entries for semantically similar requests. Because semantic caching can lead to false cache entries, you might want to manually control the cache, and add or remove a cache entry for a specific request.

Gloo Gateway comes with a built-in ai-extension-apiserver component that exposes a REST API that you can use to add or remove cache entries for specific requests. The ai-extension-apiserver is deployed by using the Gloo Gateway Helm chart.

  1. Get the Helm values files for your current Gloo Gateway installation.

      helm get values gloo-gateway -n gloo-system -o yaml > gloo-gateway.yaml
    open gloo-gateway.yaml
      
  2. Add the following section to your Helm values file to include the ai-extension-apiserver component. You use this component to manually control cache entries.

      
    global:
      extensions:
        aiExtension:
          apiServer:
            enabled: true
      
  3. Upgrade your release.

      helm repo update
    helm upgrade -i gloo-gateway glooe/gloo-ee \
      --namespace gloo-system \
      -f gloo-gateway.yaml \
      --version=1.18.0-beta2
      
  4. Check that the ai-extension-apiserver component is running.

      kubectl get deploy -n gloo-system ai-extension-apiserver
      

    Example output:

      NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
    ai-extension-apiserver   1/1     1            1           3h54m
      
  5. Update the semantic caching configuration in the RouteOption resource to enable READ_ONLY mode.

      kubectl apply -f - <<EOF
    apiVersion: gateway.solo.io/v1
    kind: RouteOption
    metadata:
      name: openai-opt
      namespace: gloo-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: openai
      options:
        ai:
          semanticCache:
            mode: READ_ONLY
            datastore:
              redis:
                connectionString: redis://redis-cache:6379
            embedding:
              openai:
                authToken:
                  secretRef:
                    name: openai-secret
                    namespace: gloo-system
    EOF
      
  6. Create an HTTPRoute resource to expose the ai-extension-apiserver component.

      kubectl apply -f - <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: cache-service
      namespace: gloo-system
    spec:
      parentRefs:
        - name: ai-gateway
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /cache
        filters:
          - type: URLRewrite
            urlRewrite:
              path:
                replacePrefixMatch: /
                type: ReplacePrefixMatch
        backendRefs:
        - name: ai-extension-apiserver
          namespace: gloo-system
          port: 8000
    EOF
      
  7. Optional: View the Swagger documentation for the api-extension-apiserver component by navigating to the $INGRESS_GW_ADDRESS:8080/cache/docs endpoint in your web browser.

      open http://$INGRESS_GW_ADDRESS:8080/cache/docs
      
  8. Send a request to the ai-extension-apiserver component to clear the cached request from the previous tutorial.

    When interacting with the ai-extension-apiserver Rest API, all endpoints must include the cache_id path parameter. The cache_id parameter represents the namespace.name of the RouteOption resource that is used to cache the request. In this example, the cache_id is gloo-system.openai-opt.

      curl -X DELETE "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" \
        -F "model=gpt-3.5-turbo" \
        -F "stream=false"
      
  9. Repeat the request from the previous tutorial. Verify that the x-gloo-semantic-cache: hit header is no longer present in the response.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "How many varieties of cheeses are in France?"
        }
      ]
    }'
      

    Example output:

         ...
       {
         "id": "chatcmpl-A1YvB0dmwVem3gsTmpkvnl3QZUIb7",
         "object": "chat.completion",
         "created": 1724935929,
         "model": "gpt-3.5-turbo-0125",
         "choices": [
           {
             "index": 0,
             "message": {
               "role": "assistant",
               "content": "There are over 1,200 different varieties of cheeses in France.",
               "refusal": null
             },
             "logprobs": null,
             "finish_reason": "stop"
           }
         ],
         "usage": {
           "prompt_tokens": 16,
           "completion_tokens": 14,
           "total_tokens": 30
         },
         "system_fingerprint": null
       }
       ...
      
  10. Repeat the same request a few times. Verify that the request is not automatically cached and the x-gloo-semantic-cache: hit header is not returned.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "How many varieties of cheeses are in France?"
        }
      ]
    }'
      

    Example output:

         ...
       {
         "id": "chatcmpl-A1YxsmOhZodbcKwoLMy4dqrtk9l7g",
         "object": "chat.completion",
         "created": 1724936096,
         "model": "gpt-3.5-turbo-0125",
         "choices": [
           {
             "index": 0,
             "message": {
               "role": "assistant",
               "content": "It is estimated that there are over 1,200 varieties of cheeses produced in France.",
               "refusal": null
             },
             "logprobs": null,
             "finish_reason": "stop"
           }
         ],
         "usage": {
           "prompt_tokens": 16,
           "completion_tokens": 18,
           "total_tokens": 34
         },
         "system_fingerprint": null
       }
       ...
      
  11. Send a request to the ai-extension-apiserver to add a cache entry for the request that you previously sent.

      echo '{
        "model": "gpt-3.5-turbo",
        "messages": [
          {
            "role": "user",
            "content": "How many varieties of cheeses are in France?"
          }
        ]
      }' > request.json
    
    echo '{
      "id": "fake",
      "object": "chat.completion",
      "created": 1722966273,
      "model": "gpt-3.5-turbo",
      "choices": [
          {
              "index": 0,
              "message": {
                  "role": "assistant",
                  "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.",
                  "refusal": null
              },
              "logprobs": null,
              "finish_reason": "stop"
          }
      ],
      "usage": {
          "prompt_tokens": 11,
          "completion_tokens": 310,
          "total_tokens": 321
      },
      "system_fingerprint": "fp_48196bc67a"
    }' > response.json
    
    curl -X PUT "$INGRESS_GW_ADDRESS:8080/cache/semantic-cache/gloo-system.openai-opt/contents" -F "req=@request.json" -F "data=@response.json"
      
  12. Send another request to the LLM. Verify that you get back the exact response from the response.json file that you manually added to the cache.

      curl -v "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
        "model": "gpt-3.5-turbo",
        "messages": [
          {
            "role": "user",
            "content": "How many varieties of cheeses are in France?"
          }
        ]
      }'
      

    Example output:

         ...
       {
         "id": "fake",
         "object": "chat.completion",
         "created": 1722966273,
         "model": "gpt-3.5-turbo",
         "choices": [
             {
                 "index": 0,
                 "message": {
                     "role": "assistant",
                     "content": "There are many varieties of cheeses in France. Some of the most popular ones include Brie, Camembert, Roquefort, and Comté. Each of these cheeses has a unique flavor and texture, making them a delight for cheese lovers around the world.",
                     "refusal": null
                 },
                 "logprobs": null,
                 "finish_reason": "stop"
             }
         ],
         "usage": {
             "prompt_tokens": 11,
             "completion_tokens": 310,
             "total_tokens": 321
         },
         "system_fingerprint": "fp_48196bc67a"
       }
       ...
      

Next

You can optionally remove the resources that you set up as part of this guide. For steps, see Cleanup.