Failover
Use a failover policy to determine where to reroute traffic in case of failure.
About
Failover is an important part of building resilient apps in multicluster environments. You set up locality-aware failover by specifying regions, zones, and subzones to reroute traffic. In the event of a failure in the closest locality, responses can be served from the next closest locality.
Failover with other policies
You can use failover, outlier detection, and retry timeout policies together to build a more resilient application network. For example, an outlier detection policy can remove unhealthy destinations, a failover policy can redirect traffic to healthy destinations, and a retry policy can retry requests in case of failure. Review the following table to understand what each policy does.
Policy | Purpose |
---|---|
Failover | Choose destinations to re-route traffic to, based on the closest locality. |
Outlier detection | Determine when and for how long to remove unhealthy destinations from the pool of healthy destinations. |
Retry timeout | Decide how many times to retry requests before the outlier detection policy considers the request as failing and removes the service from the pool of healthy destinations. |
Locality settings
To fine-tune traffic flow according to your operational needs, Istio has several locality settings that you can use. Review the following table to understand these settings and how to use them. These settings are mutually exclusive, so choose the best one for your use case. For more information, see the Istio docs for LocalityLoadBalancerSetting
and locality failover.
Istio locality settings | Description | Usage Case | Configuration in Gloo FailoverPolicy |
---|---|---|---|
distribute | Controls the percentage of traffic sent to endpoints in specific localities. You can set specific load balancing weights across different zones and regions. | Distribute traffic based on locality to enhance user experience in terms of performance, or to manage the load across resources more efficiently. Note that this setting does not fail over traffic, but instead distributes traffic. | In the localityMappings.to section to specify a region or zone to distribute traffic to, set a weight field. The weights should add up to 100 across all the localities included in the mapping. For more information, see the example. |
failover | Specifies an alternative region to redirect traffic to when local endpoints become unhealthy. | Increase high availability and resilience by ensuring traffic is served from the next best locality on failure. Note that this setting does not distribute traffic, but instead redirects traffic in case of failure. | Use the localityMappings to specify the regions to fail over traffic to. Do not include a weight , which changes the setting to the Istio distribute load balancing setting. Apply an OutlierDetectionPolicy along with the FailoverPolicy so that unhealthy destinations are removed from the pool of available destinations. For more information, see the example. |
failoverPriority | Sorts endpoints based on labels to prioritize traffic routing in case of multiple zonal, regional, or other failures. | Use for more complex setups to prioritize traffic across multiple clusters or regions based on specific labels. | Use the priorityLabels section to specify the matching labels. You can include only the label key, or both the key-value pair for more exact matching. Apply an OutlierDetectionPolicy along with the FailoverPolicy so that unhealthy destinations are removed from the pool of available destinations. For more information, see the example. |
Failover priority
In Gloo Mesh Enterprise 2.5.5 or 2.6 and later, you can set up failover priority. Failover priority is an Istio locality setting that you configure in a FailoverPolicy with the priorityLabels
option.
When a destination becomes unhealthy, the traffic fails over to the next healthy destination based on the priorities that you set. If that destination becomes unhealthy, the traffic fails over to the next prioritized healthy destination, and so on. If one of the previously unhealthy destinations becomes healthy again, traffic resumes to that destination.
However, failover priority has the following limitations:
- Prioritizing with label key-value pairs (such as
topology.kubernetes.io/region=us-west
) is not available in Istio 1.17 or earlier . You can use the label key instead (such astopology.kubernetes.io/region
). - Prioritizing workloads in remote clusters: Prioritization for remote endpoints is not done via individual workload labels but rather by common workload labels between all backing workloads and topology labels on the east-west gateway.
- Removing unhealthy workloads in remote clusters: Outlier detection can eject the east-west gateway in some failover situations, leading to failover not working as expected.
Priority limitation for remote clusters
You cannot prioritize failover to specific workloads in remote clusters. This limitation is due to how failover is prioritized via labels and how east-west traffic routing works.
You configure failover priority by using key-value labels that select the virtual destination for failover traffic. During translation of the virtual destination, Gloo Mesh Enterprise creates an Istio ServiceEntry for workloads that are in the same cluster or WorkloadEntries for workloads in remote clusters. These translated Istio resources get the same failover labels as the virtual destination.
This setup works for failover priority in the same cluster, where traffic is failed over to the ServiceEntry based on the labels that you set. However, for traffic that fails over to a remote cluster, traffic passes through the east-west gateway before reaching the backing WorkloadEntry. Although the WorkloadEntries have the failover labels, the east-west gateway typically does not. Therefore, if you set a priority level such as zero (0
, the top priority level) for a certain key-value label, the east-west gateway does not match the label and so does not get the prioritized failover traffic.
To avoid this scenario, you can use simple topology labels. If the east-west gateway shares the same topology labels as the backing workload in the remote cluster, and the topology labels are used for failover priority, then the failover priority works as expected.
Unhealthy workloads limitation in remote clusters
Outlier detection operates at Layer 7, looking for HTTP status codes to find outliers. For client requests to services in remote clusters, the traffic passes through the east-west gateway. However, east-west gateways perform TLS passthrough at Layer 4 by default. L4 traffic does not return HTTP status codes, only simple L4 errors such as connection refused
. Therefore, the client gets back an L7 error and detects an outlier coming from the east-west gateway instead of the remote cluster service. In this case, the east-west gateway gets ejected because the client proxy cannot differentiate between the east-west gateway and the remote cluster service. When the east-west gateway is ejected, all of its backing services in the remote cluster can no longer be reached, even if the others are healthy.
As such, apply an outlier detection policy to services in the same cluster or to external services. This way, the services are be removed from the pool of healthy destinations more predictably.
Failover priority scenario
Review the following diagrams to learn how failover priority works.
1. Initial setup
To begin with, you have the following setup:
- You have two clusters.
- Each cluster has two nodes in different availability zones (AZ 1 and AZ 2).
- The VirtualDestination in Cluster 1 spans both clusters.
- Each cluster has a backing workload for the virtual destination in each node, for a total of 4 backing workloads.
- The client proxy that makes requests is in Cluster 1, AZ 1. Therefore, Cluster 1 is the local cluster and Cluster 2 is the remote cluster accessed via the east-west gateway.
2. Priority labels
The following scenario shows the policies that you create for the virtual destination. These include an outlier detection policy and a failover policy. In the failover policy, you configure the following failover priorities via topology labels. Both clusters have the same region label. The zone labels map to each cluster. The subzone labels map to the availability zone of each node in the clusters.
...
priorityLabels:
labels:
- topology.kubernetes.io/region
- topology.kubernetes.io/zone
- topology.istio.io/subzone
By default, the failover policy routes all traffic to the backing workload that matches the most labels. In this scenario, the workload is in Cluster 1, AZ 1.
3. Zonal failure
The following scenario shows a failure in the client’s availability zone, AZ 1. As such, traffic fails over to other zone in the same cluster: Cluster 1, AZ 2.
4. Cluster failure
In the following scenario, both availability zones in the client’s cluster experience a failure: Cluster 1, AZ 1 and AZ 2. As such, traffic fails over to Cluster 2, AZ 1.
5. Cluster and zonal failure
The following scenario shows failures in both the client’s cluster and across the availability zone: Cluster 1 and AZ 1.
Now, you might expect that the traffic would fail over to the healthy workload in Cluster 2, AZ 2. However, due to the failover and outlier detection limitation, that scenario does not happen.
Instead, the east-west gateway is ejected because the client in Cluster 1, AZ 1 gets a failure reported back. The client proxy cannot differentiate between the east-west gateway and the unhealthy backing service, and so it ejects the east-west gateway and all its backing destinations. Traffic does not fail over to the healthy workload in Cluster2, AZ 2.
Before you begin
This guide assumes that you use the same names for components like clusters, workspaces, and namespaces as in the getting started. If you have different names, make sure to update the sample configuration files in this guide.
Complete the multicluster getting started guide to set up the following testing environment.
- Three clusters along with environment variables for the clusters and their Kubernetes contexts.
- The Gloo
meshctl
CLI, along with other CLI tools such askubectl
andistioctl
. - The Gloo management server in the management cluster, and the Gloo agents in the workload clusters.
- Istio installed in the workload clusters.
- A simple Gloo workspace setup.
- Install Bookinfo and other sample apps.
If you import or export resources across workspaces, your policies might not apply. For more information, see Import and export policies.
Configure failover policies
You can apply a failover policy at the destination level. For more information, see Applying policies. Note that for one destination, you cannot apply both a failover policy that specifies zones and subzones and a failover policy that only specifies regions. For one destination, you can specify multiple failover policies that specify zones and subzones, or multiple that specify regions. However, ensure that the configuration does not overlap between multiple policies.
For example, if one failover policy reroutes traffic from us-east-1 to us-east-2, and another reroutes traffic from us-east-2 to eu-west-1, the configurations do not overlap. But if one failover policy reroutes traffic from us-east-1 to us-east-2, and another reroutes traffic from us-east-1 to eu-west-1, then the configurations overlap, and traffic might not be correctly rerouted.
This policy currently does not support selecting ExternalServices as a destination.
Distribute with weight example
Review the following example that distributes traffic evenly across us-east
and us-west
regions.
apiVersion: resilience.policy.gloo.solo.io/v2
kind: FailoverPolicy
metadata:
annotations:
cluster.solo.io/cluster: ""
name: locality-based-failover
namespace: bookinfo
spec:
applyToDestinations:
- kind: VIRTUAL_DESTINATION
selector: {}
config:
localityMappings:
- from:
region: us-east
to:
- region: us-west
weight: 50
- region: us-east
weight: 50
Review the following table to understand this configuration. For more information, see the API docs and the Istio docs.
Setting | Description |
---|---|
applyToDestinations | Use labels to apply the policy to destinations. Destinations might be a Kubernetes service, VirtualDestination, or ExternalService (if supported by the policy). If you do not specify any destinations or routes, the policy applies to all destinations in the workspace by default. If you do not specify any destinations but you do specify a route, the policy applies to the route but to no destinations. |
localityMappings | Map the localities to fail over traffic from one region, zone, or subzone to another in case of failure. The locality is determined by the Kubernetes labels on the node where the destination’s app runs. For more information, see the Istio docs. |
from | The locality of the destination where Gloo Mesh Enterprise originally tried to fulfill the request. In this example, the policy distributes traffic for the us-east region. |
to | The localities of the destination where Gloo Mesh Enterprise can distribute requests. You must specify the region, and optionally the zone and subzone. Include the original region to keep the region in the distribution. In this example, the policy distributes traffic in an equal 50/50 split between the us-east and us-west regions. |
Failover example
Review the following example that fails over traffic from the us-east
to us-west
region in case of a failure.
apiVersion: resilience.policy.gloo.solo.io/v2
kind: FailoverPolicy
metadata:
annotations:
cluster.solo.io/cluster: ""
name: locality-based-failover
namespace: bookinfo
spec:
applyToDestinations:
- kind: VIRTUAL_DESTINATION
selector: {}
config:
localityMappings:
- from:
region: us-east
to:
- region: us-west
Review the following table to understand this configuration. For more information, see the API docs and the Istio docs.
Setting | Description |
---|---|
applyToDestinations | Use labels to apply the policy to destinations. Destinations might be a Kubernetes service, VirtualDestination, or ExternalService (if supported by the policy). If you do not specify any destinations or routes, the policy applies to all destinations in the workspace by default. If you do not specify any destinations but you do specify a route, the policy applies to the route but to no destinations. |
localityMappings | Map the localities to fail over traffic from one region, zone, or subzone to another in case of failure. The locality is determined by the Kubernetes labels on the node where the destination’s app runs. For more information, see the Istio docs. |
from | The locality of the destination where Gloo Mesh Enterprise originally tried to fulfill the request. In this example, the policy fails over traffic from any destinations served in the us-east region. |
to | The localities of the destination where Gloo Mesh Enterprise can reroute requests. You must specify the region, and optionally the zone and subzone. In this example, the policy reroutes traffic to any matching destinations only in the us-west region. |
Failover priority example
Review the following example that prioritizes destinations for failover traffic based on key-value labels. Destinations that match the most labels are prioritized first.
Based off the example, the order of prioritized traffic could be:
- All labels match: Traffic is sent first to an app that has a label to denote compatability with both
v1
andv4
, as well as a label that says it runs in thea
subzone rack in your data center. - Two labels match: In case of failure or no match, traffic is sent next to a
v1
app orv4
app in thea
subzone rack in your data center. - One label matches: In case of failure or no match, traffic is sent next to an available destination with the
v1
,v4
, or subzonea
label.
This priority order is provided as an example for illustrative purposes. In practice, your configuration might vary. In particular, for multicluster use cases, failover uses the labels of the east-west gateway, which often does not have the same labels as the app’s pod. Review the failover priority limitations and compare your scenario with the Istio docs.
apiVersion: resilience.policy.gloo.solo.io/v2
kind: FailoverPolicy
metadata:
annotations:
cluster.solo.io/cluster: ""
name: locality-based-failover-with-priority
namespace: bookinfo
spec:
applyToDestinations:
- kind: VIRTUAL_DESTINATION
selector: {}
config:
priorityLabels:
labels:
- version=v1
- version=v4
- topology.istio.io/subzone=a
Review the following table to understand this configuration. For more information, see the API docs and the Istio docs.
Setting | Description |
---|---|
applyToDestinations | Use labels to apply the policy to destinations. Destinations might be a Kubernetes service, VirtualDestination, or ExternalService (if supported by the policy). If you do not specify any destinations or routes, the policy applies to all destinations in the workspace by default. If you do not specify any destinations but you do specify a route, the policy applies to the route but to no destinations. |
priorityLabels | Prioritize destinations to fail over traffic to by configuring priority labels. In general, destinations that match the most labels have higher priority during failover. For more information about priority rules, see the failoverPriority setting in the Istio docs. When using priority labels, you must specify either an ordered list of label keys or an ordered list of label key-value pairs. You cannot have an ordered list that includes both label keys and label key-value pairs. Note: You cannot use label key-value pairs for failover priority in Istio 1.17 or earlier (such as version=v1 , only the label key (such as version ). |
Verify failover policies
You can test how failover works by opening the Bookinfo app in your browser and observing the reviews app behavior after applying various resources.
Verify that your clusters have
topology.kubernetes.io/region
locality labels. If not, see Configure the locality labels for nodes.kubectl get nodes --context $REMOTE_CONTEXT1 -o jsonpath='{.items[*].metadata.labels}' kubectl get nodes --context $REMOTE_CONTEXT2 -o jsonpath='{.items[*].metadata.labels}'
Create a virtual destination for the reviews app. The virtual destination enables multicluster traffic routing.
kubectl --context ${REMOTE_CONTEXT1} apply -f - <<EOF apiVersion: networking.gloo.solo.io/v2 kind: VirtualDestination metadata: annotations: cluster.solo.io/cluster: "" name: reviews-global namespace: bookinfo spec: hosts: - reviews.vd ports: - number: 80 protocol: HTTP targetPort: name: http services: - labels: app: reviews EOF
Create an outlier detection policy to use with the failover policy so that unhealthy destinations are removed. The outlier detection policy also ensures that requests are routed to the closest locality.
kubectl --context ${REMOTE_CONTEXT1} apply -f - <<EOF apiVersion: resilience.policy.gloo.solo.io/v2 kind: OutlierDetectionPolicy metadata: annotations: cluster.solo.io/cluster: "" name: outlier-detection namespace: bookinfo spec: applyToDestinations: - kind: VIRTUAL_DESTINATION selector: {} config: baseEjectionTime: 30s consecutiveErrors: 2 interval: 1s maxEjectionPercent: 100 EOF
Create your failover policy. The following policy uses the failover example to redirect traffic from
us-east
tous-west
in case of failure.If your clusters have different region labels thanus-east
andus-west
, update those values accordingly.kubectl --context ${REMOTE_CONTEXT1} apply -f - <<EOF apiVersion: resilience.policy.gloo.solo.io/v2 kind: FailoverPolicy metadata: annotations: cluster.solo.io/cluster: "" name: locality-based-failover namespace: bookinfo spec: applyToDestinations: - kind: VIRTUAL_DESTINATION selector: {} config: localityMappings: - from: region: us-east to: - region: us-west EOF
Send a request to the reviews app from the ratings app several times. Notice that although the virtual destination serves all 3 reviews versions, you only get responses with no stars (v1) and black stars (v2) from the
cluster-1
cluster because the outlier detection forces all requests to be routed to the closest locality.kubectl exec $(kubectl get pod -l app=ratings -n bookinfo -o jsonpath='{.items[].metadata.name}' --context ${REMOTE_CONTEXT1}) -n bookinfo -c ratings --context ${REMOTE_CONTEXT1} -- curl -sS reviews.global:80/reviews/1 -v
Send the reviews v1 and v2 apps in
cluster-1
to sleep, to mimic an app failure in a locality.kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v1 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":["sleep","20h"]}]}}}}' kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v2 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":["sleep","20h"]}]}}}}'
Repeat the request to the reviews app. Notice that you get responses with only red stars (v3). The unhealthy reviews v1 and v2 apps are removed, and the traffic fails over to v3 in the locality that the failover policy specifies.
kubectl exec $(kubectl get pod -l app=ratings -n bookinfo -o jsonpath='{.items[].metadata.name}' --context ${REMOTE_CONTEXT1}) -n bookinfo -c ratings --context ${REMOTE_CONTEXT1} -- curl -sS reviews.global:80/reviews/1 -v
Cleanup
You can optionally remove the resources that you set up as part of this guide.- Remove the sleep command from the reviews apps to restore normal behavior.
kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v1 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":[]}]}}}}' kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v2 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":[]}]}}}}'
- Clean up the Gloo resources that you created.
kubectl --context $REMOTE_CONTEXT1 -n bookinfo delete VirtualDestination reviews-global kubectl --context $REMOTE_CONTEXT1 -n bookinfo delete OutlierDetectionPolicy outlier-detection kubectl --context $REMOTE_CONTEXT1 -n bookinfo delete FailoverPolicy locality-based-failover