On this page

Failover

Prioritize the failover of requests across different models from an LLM provider.

About failover

Failover is a way to keep services running smoothly by automatically switching to a backup system when the main one fails or becomes unavailable.

For AI gateways, you can set up failover for the models of the LLM providers that you want to prioritize. If the main model from one provider goes down, slows, or has any issue, the system quickly switches to a backup model from that same provider. This keeps the service running without interruptions.

This approach increases the resiliency of your network environment by ensuring that apps that call LLMs can keep working without problems, even if one model has issues.

Before you begin

Fail over to other LLM models

In this example, you deploy an example model-failover app to your cluster. The app simulates a failure scenario for three models from the OpenAI LLM provider.

Deploy the example model-failover app. The app simulates a failure scenario for three models from the OpenAI LLM provider. This way, you can check that the request fails over to each model in turn.

  kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-failover
  namespace: gloo-system
spec:
  selector:
    matchLabels:
      app: model-failover
  replicas: 1
  template:
    metadata:
      labels:
        app: model-failover
    spec:
      containers:
        - name: model-failover
          image: gcr.io/field-engineering-eu/model-failover:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: model-failover
  namespace: gloo-system
  labels:
    app: model-failover
spec:
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  selector:
    app: model-failover
EOF

Verify that the model-failover app is running.

  kubectl -n gloo-system rollout status deploy model-failover

Create or update the Upstream for your LLM providers. The following example uses the spec.ai.multi.priorities setting to configure three pools. Each pool represents a specific model from the LLM provider that fails over in the following order of priority. By default, each request is tried 3 times before marked as failed. The Upstream uses the model-failover as the destination for requests instead of the actual OpenAI API endpoint. For more information, see the MultiPool API reference in the Gloo Edge docs.

OpenAI gpt-4o model
OpenAI gpt-4.0-turbo model
OpenAI gpt-3.5-turbo model

  kubectl apply -f - <<EOF
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  labels:
    app: gloo
  name: model-failover
  namespace: gloo-system
spec:
  ai:
    multi:
      priorities:
      - pool:
        - openai:
            model: "gpt-4o"
            customHost:
              host: model-failover.gloo-system.svc.cluster.local
              port: 80
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
      - pool:
        - openai:
            model: "gpt-4.0-turbo"
            customHost:
              host: model-failover.gloo-system.svc.cluster.local
              port: 80
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
      - pool:
        - openai:
            model: "gpt-3.5-turbo"
            customHost:
              host: model-failover.gloo-system.svc.cluster.local
              port: 80
            authToken:
              secretRef:
                name: openai-secret
                namespace: gloo-system
EOF

Create an HTTPRoute resource that routes incoming traffic on the /model path to the Upstream backend that you created in the previous step. In this example, the URLRewrite filter rewrites the path from /model to the path of the API in the LLM provider that you want to use, such as /v1/chat/ completions for OpenAI.

  kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: model-failover
  namespace: gloo-system
spec:
  parentRefs:
    - name: ai-gateway
      namespace: gloo-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /model
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplaceFullPath
          replaceFullPath: /v1/chat/completions
    backendRefs:
    - name: model-failover
      namespace: gloo-system
      group: gloo.solo.io
      kind: Upstream
EOF

Create a RouteOption that applies a retry policy to the model-failover route. The retry sets the trigger to retriable-status-codes, the status code to return to 429, and the number of retries to attempt before marking the upstream as unavailable to 3.

  kubectl apply -f - <<EOF
apiVersion: gateway.solo.io/v1
kind: RouteOption
metadata:
  name: model-failover
  namespace: gloo-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: model-failover
  options:
    retries:
      retryOn: 'retriable-status-codes'
      retriableStatusCodes:
      - 429
      numRetries: 3
      previousPriorities:
        updateFrequency: 1
EOF

Get the external address of the gateway and save it in an environment variable.

  export INGRESS_GW_ADDRESS=$(kubectl get svc -n gloo-system gloo-proxy-ai-gateway -o jsonpath="{.status.loadBalancer.ingress[0]['hostname','ip']}")
echo $INGRESS_GW_ADDRESS

  kubectl port-forward deployment/gloo-proxy-ai-gateway -n gloo-system 8080:8080

Send a request to observe the failover.

  curl -v "$INGRESS_GW_ADDRESS:8080/model" -H content-type:application/json -d '{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}'

Example output: Note the example model-failover app is configured to return a 429 response to simulate a model failure.

  ...
< HTTP/1.1 429 Too Many Requests

Check the logs of the model-failover app to verify that the requests were received in the order of priority, starting with the gpt-4o model.

  kubectl logs deploy/model-failover -n gloo-system

Example output: Notice the 3 log lines that correspond to the initial request (sent to model gpt-4o) and the two failover requests (sent to models gpt-4.0-turbo and gpt-3.5-turbo respectively).

  {"time":"2024-07-01T17:11:23.994822887Z","level":"INFO","msg":"Request received","msg":"{\"messages\":[{\"content\":\"You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.\",\"role\":\"system\"},{\"content\":\"Compose a poem that explains the concept of recursion in programming.\",\"role\":\"user\"}],\"model\":\"gpt-4o\"}"}
{"time":"2024-07-01T17:11:24.006768184Z","level":"INFO","msg":"Request received","msg":"{\"messages\":[{\"content\":\"You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.\",\"role\":\"system\"},{\"content\":\"Compose a poem that explains the concept of recursion in programming.\",\"role\":\"user\"}],\"model\":\"gpt-4.0-turbo\"}"}
{"time":"2024-07-01T17:11:24.012805385Z","level":"INFO","msg":"Request received","msg":"{\"messages\":[{\"content\":\"You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.\",\"role\":\"system\"},{\"content\":\"Compose a poem that explains the concept of recursion in programming.\",\"role\":\"user\"}],\"model\":\"gpt-3.5-turbo\"}"}

Cleanup

You can optionally remove the resources that you set up as part of this guide.

  kubectl delete secret -n gloo-system openai-secret
kubectl delete deployment -n gloo-system model-failover
kubectl delete deployment -n gloo-system model-failover
kubectl delete upstream -n gloo-system model-failover
kubectl delete httproute -n gloo-system model-failover
kubectl delete routeoption -n gloo-system model-failover

Failover

About failover link

Before you begin link

Fail over to other LLM models link

Cleanup link

About failover

Before you begin

Fail over to other LLM models

Cleanup