Retrieval augmented generation (RAG)
Provide relevant context for LLM provider by retrieving data from one or more datasets.
About retrieval augmented generation (RAG)
Retrieval augmented generation (RAG) is a technique of providing relevant context by retrieving relevant data from one or more datasets and augmenting the prompt with the retrieved information. This approach helps LLMs generate more accurate and relevant responses and, to a certain point, prevent hallucinations.
In the following tutorial, you configure the vector datastore used for RAG and see how it helps LLMs generate more accurate responses.
Before you begin
Complete the Authenticate with API keys tutorial.
Set up a RAG datastore
Deploy a vector database that includes data and embeddings from a website that provides information about French cheeses.
kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: vector-db labels: app: vector-db spec: replicas: 1 selector: matchLabels: app: vector-db template: metadata: labels: app: vector-db spec: containers: - name: db image: gcr.io/field-engineering-eu/vector-db imagePullPolicy: IfNotPresent ports: - containerPort: 5432 env: - name: POSTGRES_DB value: gloo - name: POSTGRES_USER value: gloo - name: POSTGRES_PASSWORD value: gloo --- apiVersion: v1 kind: Service metadata: name: vector-db spec: selector: app: vector-db ports: - protocol: TCP port: 5432 targetPort: 5432 EOF
Send a request without using RAG.
curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{ "model": "gpt-4o", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }'
Note that the response is verbose and not as accurate as expected. You might get a response similar to the following:
{ "id": "chatcmpl-AEJFJIavD5NkGwyduU82sHbpj2fS7", "object": "chat.completion", "created": 1727973937, "model": "gpt-4o-2024-08-06", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "France is famous for its vast variety of cheeses, and it's often said that there are over 1,000 different types. This number can vary depending on how cheeses are classified, considering factors like regional variations, aging processes, and even seasonal differences. Charles de Gaulle famously remarked about the difficulty of governing a country with \"246 varieties of cheese,\" but the actual number is considerably higher when all local and artisanal varieties are counted.", "refusal": null }, ...
In your RouteOption resource, add the following
spec.options.ai.rag
section to configure the OpenAI route to use the vector database for RAG.kubectl apply -f - <<EOF apiVersion: gateway.solo.io/v1 kind: RouteOption metadata: name: openai-opt namespace: gloo-system spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: openai options: ai: rag: datastore: postgres: connectionString: postgresql+psycopg://gloo:gloo@vector-db.default.svc.cluster.local:5432/gloo collectionName: default embedding: openai: authToken: secretRef: name: openai-secret namespace: gloo-system EOF
Repeat the request and verify that the response is now concise and accurate. This time, Gloo AI Gateway uses the RAG options that you set up to automatically attach additional context to the query that improves the response.
curl "$INGRESS_GW_ADDRESS:8080/openai" -H content-type:application/json -d '{ "model": "gpt-4o", "messages": [ { "role": "user", "content": "How many varieties of cheeses are in France?" } ] }'
Example output:
{ "id": "chatcmpl-AGsLfbPZY6Ld2u9PtX473jgdj4KA4", "object": "chat.completion", "created": 1728585527, "model": "gpt-4o-2024-08-06", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "France has between 1,000-1,600 varieties of cheese.", "refusal": null }, ...
Next
Reduce the number of requests sent to the LLM provider, improve the response time, and reduce costs by using semantic caching.