About retrieval augmented generation (RAG)

Retrieval augmented generation (RAG) is a technique of providing relevant context by retrieving relevant data from one or more datasets and augmenting the prompt with the retrieved information. This approach helps LLMs generate more accurate and relevant responses and, to a certain point, prevent hallucinations.

In the following tutorial, you configure the vector datastore used for RAG and see how it helps LLMs generate more accurate responses.

Before you begin

Complete the Authenticate with API keys tutorial.

Set up a RAG datastore

  1. Deploy a vector database that includes data and embeddings from a website that provides information about French cheeses.

      kubectl apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vector-db
      labels:
        app: vector-db
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vector-db
      template:
        metadata:
          labels:
            app: vector-db
        spec:
          containers:
          - name: db
            image: gcr.io/field-engineering-eu/vector-db
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 5432
            env:
            - name: POSTGRES_DB
              value: gloo
            - name: POSTGRES_USER
              value: gloo
            - name: POSTGRES_PASSWORD
              value: gloo
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vector-db
    spec:
      selector:
        app: vector-db
      ports:
        - protocol: TCP
          port: 5432
          targetPort: 5432
    EOF
      
  2. Send a request without using RAG.

      curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
       "model": "gpt-4o",
       "messages": [
         {
           "role": "user",
           "content": "How many varieties of cheeses are in France?"
         }
       ]
     }'
      

    Note that the response is verbose and not as accurate as expected. You might get a response similar to the following:

      {
      "id": "chatcmpl-AEJFJIavD5NkGwyduU82sHbpj2fS7",
      "object": "chat.completion",
      "created": 1727973937,
      "model": "gpt-4o-2024-08-06",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "France is famous for its vast variety of cheeses, and it's often said that there are over 1,000 different types. This number can vary depending on how cheeses are classified, considering factors like regional variations, aging processes, and even seasonal differences. Charles de Gaulle famously remarked about the difficulty of governing a country with \"246 varieties of cheese,\" but the actual number is considerably higher when all local and artisanal varieties are counted.",
            "refusal": null
          },
          ...
      
  3. In your RouteOption resource, add the following spec.options.ai.rag section to configure the OpenAI route to use the vector database for RAG. Note that this RouteOption also disables the 15 second default Envoy route timeout. This setting is required to prevent timeout errors when sending requests to an LLM. Alternatively, you can also set a timeout that is higher than 15 seconds.

      kubectl apply -f - <<EOF
    apiVersion: gateway.solo.io/v1
    kind: RouteOption
    metadata:
      name: openai-opt
      namespace: gloo-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: openai
      options:
        ai:
          rag:
            datastore:
              postgres:
                connectionString: postgresql+psycopg://gloo:gloo@vector-db.default.svc.cluster.local:5432/gloo
                collectionName: default
            embedding:
              openai:
                authToken:
                  secretRef:
                    name: openai-secret
                    namespace: gloo-system
        timeout: "0"
    EOF
      
  4. Repeat the request and verify that the response is now concise and accurate. This time, Gloo AI Gateway uses the RAG options that you set up to automatically attach additional context to the query that improves the response.

      curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
       "model": "gpt-4o",
       "messages": [
         {
           "role": "user",
           "content": "How many varieties of cheeses are in France?"
         }
       ]
     }'
      

    Example output:

      {
      "id": "chatcmpl-AGsLfbPZY6Ld2u9PtX473jgdj4KA4",
      "object": "chat.completion",
      "created": 1728585527,
      "model": "gpt-4o-2024-08-06",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "France has between 1,000-1,600 varieties of cheese.",
            "refusal": null
          },
    ...
      

Next

Reduce the number of requests sent to the LLM provider, improve the response time, and reduce costs by using semantic caching.