About retrieval augmented generation (RAG)

Retrieval augmented generation (RAG) is a technique of providing relevant context by retrieving relevant data from one or more datasets and augmenting the prompt with the retrieved information. This approach helps LLMs to generate more accurate and relevant responses and to a certain point prevent hallucinations.

In the following tutorial, you configure the vector datastore used for RAG and see how it helps LLMs to generate more accurate responses.

Before you begin

Complete the Authenticate with API keys tutorial.

Set up a RAG datastore

  1. Deploy a vector database that includes data and embeddings from a website that talks about French cheeses.

      kubectl apply -f - <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vector-db
      labels:
        app: vector-db
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vector-db
      template:
        metadata:
          labels:
            app: vector-db
        spec:
          containers:
          - name: db
            image: gcr.io/solo-public/docs/vector-db
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 5432
            env:
            - name: POSTGRES_DB
              value: gloo
            - name: POSTGRES_USER
              value: gloo
            - name: POSTGRES_PASSWORD
              value: gloo
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vector-db
    spec:
      selector:
        app: vector-db
      ports:
        - protocol: TCP
          port: 5432
          targetPort: 5432
    EOF
      
  2. Send a request without using RAG. Note that the response is verbose and not as accurate as expected.

      curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
       "model": "gpt-4o",
       "messages": [
         {
           "role": "user",
           "content": "How many varieties of cheeses are in France?"
         }
       ]
     }'
      

    Example output:

      ...
    "France is renowned for its rich cheese-making tradition, and the exact number of cheese varieties can vary depending on how one counts them. Generally, it is often cited that France boasts around 1,000 distinct varieties of cheese. This includes a wide range of types categorized by factors such as their region of origin, milk type (cow, goat, sheep), and production methods. Some of the most famous French cheeses include Brie, Camembert, Roquefort, and Comté, but the diversity extends far beyond these well-known examples."
    ...
      
  3. Configure the OpenAI route to use the vector database for RAG. Add the ai.rag.datastore.postgres section to the spec.options section of the RouteOption resource.

      kubectl apply -f - <<EOF
    apiVersion: gateway.solo.io/v1
    kind: RouteOption
    metadata:
      name: openai-opt
      namespace: gloo-system
    spec:
      targetRefs:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: openai
      options:
        ai:
          rag:
            datastore:
              postgres:
                connectionString: postgresql+psycopg://gloo:gloo@vector-db.default.svc.cluster.local:5432/gloo
                collectionName: default
            embedding:
              openai:
                authTokenRef: openai-secret
    EOF
      
  4. Repeat the request and verify that the response is now concise and accurate. This time, Gloo AI Gateway uses the RAG options that you set up to automatically attach additional context to the query that improves the response.

      curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
       "model": "gpt-4o",
       "messages": [
         {
           "role": "user",
           "content": "How many varieties of cheeses are in France?"
         }
       ]
     }'
      

    Example output:

      ...
    "France has between 1,000 and 1,600 varieties of cheese."
    ...
      

Next

Reduce the number of requests sent to the LLM provider, improve the response time, and reduce costs by using semantic caching.