On this page

Retrieval augmented generation (RAG)

Provide relevant context for LLM provider by retrieving data from one or more datasets.

About retrieval augmented generation (RAG)

Retrieval augmented generation (RAG) is a technique of providing relevant context by retrieving relevant data from one or more datasets and augmenting the prompt with the retrieved information. This approach helps LLMs generate more accurate and relevant responses and, to a certain point, prevent hallucinations.

In the following tutorial, you configure the vector datastore used for RAG and see how it helps LLMs generate more accurate responses.

warning

RAG is currently not supported for the Gemini and Vertex AI providers.

Before you begin

Complete the Authenticate with API keys tutorial.

Set up a RAG datastore

Deploy a vector database that includes data and embeddings from a website that provides information about French cheeses.

  kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vector-db
  labels:
    app: vector-db
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vector-db
  template:
    metadata:
      labels:
        app: vector-db
    spec:
      containers:
      - name: db
        image: gcr.io/field-engineering-eu/vector-db
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_DB
          value: gloo
        - name: POSTGRES_USER
          value: gloo
        - name: POSTGRES_PASSWORD
          value: gloo
---
apiVersion: v1
kind: Service
metadata:
  name: vector-db
spec:
  selector:
    app: vector-db
  ports:
    - protocol: TCP
      port: 5432
      targetPort: 5432
EOF

Send a request without using RAG.

  curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
   "model": "gpt-4o",
   "messages": [
     {
       "role": "user",
       "content": "How many varieties of cheeses are in France?"
     }
   ]
 }'

Note that the response is verbose and not as accurate as expected. You might get a response similar to the following:

  {
  "id": "chatcmpl-AEJFJIavD5NkGwyduU82sHbpj2fS7",
  "object": "chat.completion",
  "created": 1727973937,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "France is famous for its vast variety of cheeses, and it's often said that there are over 1,000 different types. This number can vary depending on how cheeses are classified, considering factors like regional variations, aging processes, and even seasonal differences. Charles de Gaulle famously remarked about the difficulty of governing a country with \"246 varieties of cheese,\" but the actual number is considerably higher when all local and artisanal varieties are counted.",
        "refusal": null
      },
      ...

In your RouteOption resource, add the following spec.options.ai.rag section to configure the OpenAI route to use the vector database for RAG. Note that this RouteOption also disables the 15 second default Envoy route timeout. This setting is required to prevent timeout errors when sending requests to an LLM. Alternatively, you can also set a timeout that is higher than 15 seconds.

  kubectl apply -f - <<EOF
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: openai-opt
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  options:
    ai:
      rag:
        datastore:
          postgres:
            connectionString: postgresql+psycopg://gloo:gloo@vector-db.default.svc.cluster.local:5432/gloo
            collectionName: default
        embedding:
          openai:
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
    timeout: "0"
EOF

Repeat the request and verify that the response is now concise and accurate. This time, Gloo AI Gateway uses the RAG options that you set up to automatically attach additional context to the query that improves the response.

  curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{
   "model": "gpt-4o",
   "messages": [
     {
       "role": "user",
       "content": "How many varieties of cheeses are in France?"
     }
   ]
 }'

Example output:

  {
  "id": "chatcmpl-AGsLfbPZY6Ld2u9PtX473jgdj4KA4",
  "object": "chat.completion",
  "created": 1728585527,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "France has between 1,000-1,600 varieties of cheese.",
        "refusal": null
      },
...

Reduce the number of requests sent to the LLM provider, improve the response time, and reduce costs by using semantic caching.

Retrieval augmented generation (RAG)

About retrieval augmented generation (RAG) link

Before you begin link

Set up a RAG datastore link

Next link

About retrieval augmented generation (RAG)

Before you begin

Set up a RAG datastore

Next