Failover
Prioritize the failover of requests across different models from an LLM provider.
About failover
Failover is a way to keep services running smoothly by automatically switching to a backup system when the main one fails or becomes unavailable.
For AI gateways, you can set up failover for the models of the LLM providers that you want to prioritize. If the main model from one provider goes down, slows, or has any issue, the system quickly switches to a backup model from that same provider. This keeps the service running without interruptions.
This approach increases the resiliency of your network environment by ensuring that apps that call LLMs can keep working without problems, even if one model has issues.
Before you begin
Fail over to other LLM models
In this example, you deploy an example model-failover
app to your cluster. The app simulates a failure scenario for three models from the OpenAI LLM provider.
Deploy the example
model-failover
app. The app simulates a failure scenario for three models from the OpenAI LLM provider. This way, you can check that the request fails over to each model in turn.kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: model-failover namespace: gloo-system spec: selector: matchLabels: app: model-failover replicas: 1 template: metadata: labels: app: model-failover spec: containers: - name: model-failover image: gcr.io/field-engineering-eu/model-failover:latest imagePullPolicy: IfNotPresent ports: - containerPort: 8080 --- apiVersion: v1 kind: Service metadata: name: model-failover namespace: gloo-system labels: app: model-failover spec: ports: - port: 80 targetPort: 8080 protocol: TCP selector: app: model-failover EOF
Verify that the
model-failover
app is running.kubectl -n gloo-system rollout status deploy model-failover
Create or update the Upstream for your LLM providers. The following example uses the
spec.ai.multi.priorities
setting to configure three pools. Each pool represents a specific model from the LLM provider that fails over in the following order of priority. By default, each request is tried 3 times before marked as failed. The Upstream uses themodel-failover
as the destination for requests instead of the actual OpenAI API endpoint. For more information, see the MultiPool API reference in the Gloo Edge docs.- OpenAI
gpt-4o
model - OpenAI
gpt-4.0-turbo
model - OpenAI
gpt-3.5-turbo
model
kubectl apply -f - <<EOF apiVersion: gloo.solo.io/v1 kind: Upstream metadata: labels: app: gloo name: model-failover namespace: gloo-system spec: ai: multi: priorities: - pool: - openai: model: "gpt-4o" customHost: host: model-failover.gloo-system.svc.cluster.local port: 80 authToken: secretRef: name: openai-secret namespace: gloo-system - pool: - openai: model: "gpt-4.0-turbo" customHost: host: model-failover.gloo-system.svc.cluster.local port: 80 authToken: secretRef: name: openai-secret namespace: gloo-system - pool: - openai: model: "gpt-3.5-turbo" customHost: host: model-failover.gloo-system.svc.cluster.local port: 80 authToken: secretRef: name: openai-secret namespace: gloo-system EOF
- OpenAI
Create an HTTPRoute resource that routes incoming traffic on the
/model
path to the Upstream backend that you created in the previous step. In this example, the URLRewrite filter rewrites the path from/model
to the path of the API in the LLM provider that you want to use, such as/v1/chat/
completions for OpenAI.kubectl apply -f- <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: model-failover namespace: gloo-system spec: parentRefs: - name: ai-gateway namespace: gloo-system rules: - matches: - path: type: PathPrefix value: /model filters: - type: URLRewrite urlRewrite: path: type: ReplaceFullPath replaceFullPath: /v1/chat/completions backendRefs: - name: model-failover namespace: gloo-system group: gloo.solo.io kind: Upstream EOF
Create a RouteOption that applies a retry policy to the
model-failover
route. The retry sets the trigger toretriable-status-codes
, the status code to return to429
, and the number of retries to attempt before marking the upstream as unavailable to3
.kubectl apply -f - <<EOF apiVersion: gateway.solo.io/v1 kind: RouteOption metadata: name: model-failover namespace: gloo-system spec: targetRefs: - group: gateway.networking.k8s.io kind: HTTPRoute name: model-failover options: retries: retryOn: 'retriable-status-codes' retriableStatusCodes: - 429 numRetries: 3 previousPriorities: updateFrequency: 1 EOF
Get the external address of the gateway and save it in an environment variable.
Send a request to observe the failover.
curl -v "$INGRESS_GW_ADDRESS:8080/model" -H content-type:application/json -d '{ "model": "gpt-4o", "messages": [ { "role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair." }, { "role": "user", "content": "Compose a poem that explains the concept of recursion in programming." } ] }'
Example output: Note the example
model-failover
app is configured to return a 429 response to simulate a model failure.... < HTTP/1.1 429 Too Many Requests
Check the logs of the
model-failover
app to verify that the requests were received in the order of priority, starting with thegpt-4o
model.kubectl logs deploy/model-failover -n gloo-system
Example output: Notice the 3 log lines that correspond to the initial request (sent to model
gpt-4o
) and the two failover requests (sent to modelsgpt-4.0-turbo
andgpt-3.5-turbo
respectively).{"time":"2024-07-01T17:11:23.994822887Z","level":"INFO","msg":"Request received","msg":"{\"messages\":[{\"content\":\"You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.\",\"role\":\"system\"},{\"content\":\"Compose a poem that explains the concept of recursion in programming.\",\"role\":\"user\"}],\"model\":\"gpt-4o\"}"} {"time":"2024-07-01T17:11:24.006768184Z","level":"INFO","msg":"Request received","msg":"{\"messages\":[{\"content\":\"You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.\",\"role\":\"system\"},{\"content\":\"Compose a poem that explains the concept of recursion in programming.\",\"role\":\"user\"}],\"model\":\"gpt-4.0-turbo\"}"} {"time":"2024-07-01T17:11:24.012805385Z","level":"INFO","msg":"Request received","msg":"{\"messages\":[{\"content\":\"You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.\",\"role\":\"system\"},{\"content\":\"Compose a poem that explains the concept of recursion in programming.\",\"role\":\"user\"}],\"model\":\"gpt-3.5-turbo\"}"}
Cleanup
You can optionally remove the resources that you set up as part of this guide.
kubectl delete secret -n gloo-system openai-secret
kubectl delete deployment -n gloo-system model-failover
kubectl delete deployment -n gloo-system model-failover
kubectl delete upstream -n gloo-system model-failover
kubectl delete httproute -n gloo-system model-failover
kubectl delete routeoption -n gloo-system model-failover