Skip to content
You are viewing the documentation for Solo Enterprise for Istio, formerly known as Gloo Mesh (OSS APIs). This version of the documentation is currently under development. Select latest from the version drop down or go to the landing page of the latest stable version.

Multicluster zone and region failover

Enterprise
Page as Markdown

Configure zone and region-aware failover for global services in a multicluster ambient mesh.

About this guide

In a multicluster ambient mesh, locality (region, zone, network) drives load balancing and failover decisions. This guide shows you how to configure zone and region-aware failover for global services, covering both L4 failover with ztunnel and L7 failover with waypoints using explicit failover priority.

For conceptual information about multicluster load balancing and failover, see the load balancing and failover overview.

Before you begin

  1. To demonstrate zone and region-aware failover, this guide requires the following cluster topology:

    • Two clusters deployed in different regions (for example, us-east-1 and us-west-2)
    • Multiple nodes per cluster in different zones within each region (for example, us-east-1a and us-east-1b in cluster 1)

    This topology allows you to test both cross-zone failover within a region and cross-region failover between clusters.

  2. Ensure that the following locality labels are set on nodes in each cluster. For guidance on setting locality labels, see the Kubernetes topology and Istio locality documentation.

    • topology.kubernetes.io/region
    • topology.kubernetes.io/zone
  3. Save the kubeconfig contexts of each cluster where you installed the multicluster mesh as environment variables.

    export context1=<cluster1_context>
    export context2=<cluster2_context>
  4. Set up a multicluster ambient mesh with the Gloo Operator or Helm.

Step 1: Verify locality labels

Verify that locality labels are set on nodes and that endpoints inherit locality correctly.

  1. Check node labels in each cluster.

    for ctx in ${context1} ${context2}; do
      echo "=== Cluster: $ctx ==="
      kubectl --context=$ctx get nodes -o custom-columns=\
    'NAME:.metadata.name,REGION:.metadata.labels.topology\.kubernetes\.io/region,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'
    done

    Example output:

    === Cluster: cluster1 ===
    NAME                  REGION      ZONE
    node-1                us-east-1   us-east-1a
    node-2                us-east-1   us-east-1b
    === Cluster: cluster2 ===
    NAME                  REGION      ZONE
    node-1                us-west-2   us-west-2a
    node-2                us-west-2   us-west-2b
  2. Review the remote peering gateways to verify that they have locality labels. In an east-west gateway setup, the mesh uses these gateway labels to determine the locality of all remote endpoints exposed through that gateway.

    kubectl --context=${context1} get deploy -n istio-system istio-eastwestgateway -o yaml | grep -A3 "topology.istio.io"
    kubectl --context=${context2} get deploy -n istio-system istio-eastwestgateway -o yaml | grep -A3 "topology.istio.io"

    Example output showing locality labels on the gateway in cluster 1:

    topology.istio.io/network: cluster1-network
    topology.kubernetes.io/region: us-east-1
    topology.kubernetes.io/zone: us-east-1a

    Example output showing locality labels on the gateway in cluster 2:

    topology.istio.io/network: cluster2-network
    topology.kubernetes.io/region: us-west-2
    topology.kubernetes.io/zone: us-west-2a

Step 2: Deploy global services across clusters

Deploy the httpbin sample app as a global service in both clusters to test multicluster load balancing and failover.

Multicluster topology: The following diagram shows the multicluster setup with global services. Each cluster has local endpoints, and the global service hostname (in-ambient.httpbin.mesh.internal) is accessible from both clusters. Locality labels (region, zone) drive load balancing and failover decisions.

    graph LR
    subgraph Cluster2["Cluster 2 (us-west-2)"]
        direction TB
        Client2[client-in-ambient]
        Ztunnel2[Client ztunnel]
        Backend2["in-ambient pod<br/>(us-west-2a)"]
        Client2 -->|"Global hostname<br/>(in-ambient.httpbin.<br/>mesh.internal)"| Ztunnel2
        Ztunnel2 --> Backend2
    end

    subgraph Cluster1["Cluster 1 (us-east-1)"]
        direction TB
        Client1[client-in-ambient]
        Ztunnel1[Client ztunnel]
        Backend1["in-ambient pod<br/>(us-east-1a)"]
        Client1 -->|"Global hostname<br/>(in-ambient.httpbin.<br/>mesh.internal)"| Ztunnel1
        Ztunnel1 --> Backend1
    end

    Ztunnel2 -.->|Failover when<br/>local unavailable| Backend1
    Ztunnel1 -.->|Failover when<br/>local unavailable| Backend2
  
  1. Deploy the in-ambient httpbin sample app in both clusters. This manifest creates the httpbin namespace with an in-ambient backend service.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx apply -f https://raw.githubusercontent.com/solo-io/doc-examples/main/istio/sample-apps/in-ambient.yaml
    done
  2. Deploy the client-in-ambient client app in both clusters.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx apply -f https://raw.githubusercontent.com/solo-io/doc-examples/main/istio/sample-apps/client-in-ambient.yaml
    done
  3. Verify that the pods are running in both clusters.

    for ctx in ${context1} ${context2}; do
      echo "=== Cluster: $ctx ==="
      kubectl --context=$ctx get pods -n httpbin
    done

    Example output:

    === Cluster: cluster1 ===
    NAME                                 READY   STATUS    RESTARTS   AGE
    client-in-ambient-6b5c96c4f8-x2j9k   1/1     Running   0          30s
    in-ambient-7d8f9b6c54-abc12          1/1     Running   0          45s
    === Cluster: cluster2 ===
    NAME                                 READY   STATUS    RESTARTS   AGE
    client-in-ambient-6b5c96c4f8-y3k0l   1/1     Running   0          30s
    in-ambient-7d8f9b6c54-def34          1/1     Running   0          45s
  4. Label the httpbin namespace to add the apps to the ambient mesh.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx label ns httpbin istio.io/dataplane-mode=ambient
    done
  5. Label the in-ambient service with solo.io/service-scope=global to expose it as a global service across clusters.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx label service in-ambient -n httpbin solo.io/service-scope=global
    done
  6. Verify that the global ServiceEntry with a hostname in the format in-ambient.httpbin.mesh.internal is created in the istio-system namespace for the labeled services. This default mesh.internal hostname makes the endpoint for your service available across the multicluster mesh.

    for ctx in ${context1} ${context2}; do
      echo "=== Cluster: $ctx ==="
      kubectl --context=$ctx get serviceentry -n istio-system | grep in-ambient
    done

    Example output:

    === Cluster: cluster1 ===
    autogen.httpbin.in-ambient   ["in-ambient.httpbin.mesh.internal"]   STATIC   30s
    === Cluster: cluster2 ===
    autogen.httpbin.in-ambient   ["in-ambient.httpbin.mesh.internal"]   STATIC   30s

Step 3: Test default multicluster L4 failover with ztunnel

Test the default load balancing behavior, which uses PreferNetwork mode to prefer local endpoints.

Default failover behavior (PreferNetwork mode): The default PreferNetwork mode prioritizes endpoints in the same network, routing all traffic to local endpoints when they are healthy. The following diagram shows failover behavior when local endpoints in cluster 1 become unavailable - ztunnel automatically fails over to endpoints in cluster 2, even though they are in a different region and network.

    graph LR
    subgraph Cluster1["Cluster 1 (us-east-1)"]
        Client1[client-in-ambient]
        Ztunnel1[Client ztunnel<br/>PreferNetwork mode]
        Backend1["in-ambient pod<br/>(us-east-1)<br/>✗ Unavailable"]
    end

    subgraph Cluster2["Cluster 2 (us-west-2)"]
        Backend2["in-ambient pod<br/>(us-west-2)<br/>✓ Healthy"]
    end

    Client1 -->|Request to global hostname| Ztunnel1
    Ztunnel1 -.->|Local unavailable| Backend1
    Ztunnel1 -->|Failover to remote| Backend2


    linkStyle 0 stroke:#2068F3,stroke-width:2px
    linkStyle 1 stroke:#999,stroke-width:2px
    linkStyle 2 stroke:#2068F3,stroke-width:2px
  
  1. Send requests from the client in cluster 1 to the global service. With the default PreferNetwork mode, traffic prefers endpoints in the same cluster network before routing to remote cluster networks.

    kubectl --context=${context1} exec -n httpbin deploy/client-in-ambient -- sh -c "
    for i in \$(seq 1 10); do
      curl -s in-ambient.httpbin.mesh.internal:8000/hostname
    done"

    Example output, in which the /hostname endpoint returns the pod hostname showing which cluster handled each request. With PreferNetwork traffic distribution, responses come primarily from the local cluster.

    in-ambient-7d8f9b6c54-abc12
    in-ambient-7d8f9b6c54-abc12
    in-ambient-7d8f9b6c54-abc12
    in-ambient-7d8f9b6c54-abc12
    ...
  2. Simulate a failure by scaling down the in-ambient service in cluster 1.

    kubectl --context=${context1} scale deployment in-ambient -n httpbin --replicas=0
  3. Send requests again from the client in cluster 1 to the global service, and verify that traffic now fails over to cluster 2.

    kubectl --context=${context1} exec -n httpbin deploy/client-in-ambient -- sh -c "
    for i in \$(seq 1 5); do
      curl -s in-ambient.httpbin.mesh.internal:8000/hostname
    done"

    Example output, in which responses now come from the cluster 2 pod after failover:

    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
  4. Scale the in-ambient service back up in cluster 1.

    kubectl --context=${context1} scale deployment in-ambient -n httpbin --replicas=1

Step 4: Configure zone-aware traffic distribution

Configure traffic distribution to prefer endpoints in the same zone, then same region. To demonstrate zone-aware failover, you need multiple replicas spread across different zones within cluster 1.

Zone-aware traffic distribution (PreferClose mode): The following diagram illustrates how the PreferClose mode prioritizes endpoints. Traffic prefers endpoints in the same zone first, then the same region, and only fails over to other regions when no closer endpoints are available.

    graph TB
    subgraph Cluster1["Cluster 1 (us-east-1)"]
        Client1[client-in-ambient<br/>us-east-1a]
        Ztunnel1[Client ztunnel<br/>PreferClose mode]
        Backend1a["in-ambient pod<br/>(us-east-1a)<br/>Priority 1: Same zone"]
        Backend1b["in-ambient pod<br/>(us-east-1b)<br/>Priority 2: Same region"]
        Client1 -->|Request to global hostname| Ztunnel1
        Ztunnel1 -->|Prefer| Backend1a
        Ztunnel1 -.->|Fallback| Backend1b
    end

    subgraph Cluster2["Cluster 2 (us-west-2)"]
        Backend2["in-ambient pod<br/>(us-west-2a)<br/>Priority 3: Different region"]
    end

    Ztunnel1 -.->|Last resort| Backend2

    style Backend1a fill:#2068F3,color:#fff
  
  1. Scale the in-ambient deployment in cluster 1 to 2 replicas so that pods are scheduled in different zones (us-east-1a and us-east-1b).

    kubectl --context=${context1} scale deployment in-ambient -n httpbin --replicas=2
  2. Verify that the pods are running in different zones.

    kubectl --context=${context1} get pods -n httpbin -l app=in-ambient -o wide

    Example output showing pods in different zones:

    NAME                          READY   STATUS    RESTARTS   AGE   IP           NODE
    in-ambient-7d8f9b6c54-abc12   1/1     Running   0          30s   10.0.1.5     node-us-east-1a
    in-ambient-7d8f9b6c54-def34   1/1     Running   0          30s   10.0.2.8     node-us-east-1b
  3. Annotate the in-ambient service with the networking.istio.io/traffic-distribution=PreferClose annotation. The PreferClose mode prioritizes endpoints in the same zone first, then the same region, and only fails over to other regions when no closer endpoints are available.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx annotate service in-ambient -n httpbin \
        networking.istio.io/traffic-distribution=PreferClose --overwrite
    done
  4. Send requests and verify that traffic prefers endpoints in the same zone as the client.

    kubectl --context=${context1} exec -n httpbin deploy/client-in-ambient -- sh -c "
    for i in \$(seq 1 10); do
      curl -s in-ambient.httpbin.mesh.internal:8000/hostname
    done" | sort | uniq -c

    Example output showing all traffic going to the pod in the same zone (us-east-1a) as the client:

    10 in-ambient-7d8f9b6c54-abc12
  5. Delete the pod in the same zone to simulate a zone-level failure.

    POD_SAME_ZONE=$(kubectl --context=${context1} get pod -n httpbin -l app=in-ambient -o jsonpath='{.items[0].metadata.name}')
    kubectl --context=${context1} delete pod -n httpbin $POD_SAME_ZONE
  6. Send requests and verify that traffic fails over to endpoints in the same region but different zone (us-east-1b).

    kubectl --context=${context1} exec -n httpbin deploy/client-in-ambient -- sh -c "
    for i in \$(seq 1 10); do
      curl -s in-ambient.httpbin.mesh.internal:8000/hostname
    done" | sort | uniq -c

    Example output showing traffic now going to the pod in us-east-1b:

    10 in-ambient-7d8f9b6c54-def34
  7. Scale down the in-ambient deployment in cluster 1 to simulate all endpoints in the region becoming unavailable.

    kubectl --context=${context1} scale deployment in-ambient -n httpbin --replicas=0
  8. Send requests and verify that traffic fails over to endpoints in cluster 2 (different region).

    kubectl --context=${context1} exec -n httpbin deploy/client-in-ambient -- sh -c "
    for i in \$(seq 1 10); do
      curl -s in-ambient.httpbin.mesh.internal:8000/hostname
    done" | sort | uniq -c

    Example output showing traffic now going to the pod in cluster 2 (us-west-2):

    10 in-ambient-8e9f0c7d65-ghi78
  9. Scale the deployment back to 2 replicas to restore the endpoints.

    kubectl --context=${context1} scale deployment in-ambient -n httpbin --replicas=2

Step 5: Add a waypoint for L7 multicluster failover

Create waypoint proxies for L7 policy enforcement and HTTP-aware failover.

Traffic flow with waypoints in multicluster: The following diagram shows how traffic flows when waypoints are deployed in each cluster. The client’s ztunnel routes to the local waypoint, which then performs L7 load balancing to backend endpoints across clusters. The PreferClose setting configured on the service in the previous step continues to apply, but is now enforced at the waypoint instead of the ztunnel.

    graph TB
    subgraph Cluster1["Cluster 1 (us-east-1)"]
        Client1[client-in-ambient]
        Ztunnel1[Client ztunnel]
        Waypoint1["Waypoint proxy (L7)"]
        Backend1["in-ambient pod<br/>(us-east-1a)"]
        Client1 -->|Request to global hostname| Ztunnel1
        Ztunnel1 -->|HBONE| Waypoint1
        Waypoint1 --> Backend1
    end

    subgraph Cluster2["Cluster 2 (us-west-2)"]
        Waypoint2["Waypoint proxy (L7)"]
        Backend2["in-ambient pod<br/>(us-west-2a)"]
    end

    Waypoint1 -.->|L7 failover| Backend2

    style Waypoint1 fill:#2068F3,color:#fff
    style Waypoint2 fill:#2068F3,color:#fff
  
  1. Create a waypoint Gateway in both clusters.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx apply -f- <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: httpbin-waypoint
      namespace: httpbin
    spec:
      gatewayClassName: istio-waypoint
      listeners:
      - name: mesh
        port: 15008
        protocol: HBONE
        allowedRoutes:
          namespaces:
            from: Same
    EOF
    done
  2. Label the httpbin namespaces to use the waypoints.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx label namespace httpbin istio.io/use-waypoint=httpbin-waypoint
    done
  3. Wait for the waypoints to be deployed.

    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx -n httpbin rollout status deployment/httpbin-waypoint
    done
  4. Verify that traffic now flows through the waypoint by sending a request from the client in cluster 1.

    kubectl --context=${context1} exec -n httpbin deploy/client-in-ambient -- \
      curl -s in-ambient.httpbin.mesh.internal:8000/hostname

Step 6: Apply DestinationRule for explicit multicluster failover priority

Create a DestinationRule with explicit failover priority and outlier detection for HTTP-aware failover.

DestinationRule failover with explicit priority: The following diagram shows how the DestinationRule with failoverPriority controls multicluster failover. The waypoint routes to endpoints based on the priority order (zone first, then region), and uses HTTP-aware outlier detection to quickly eject unhealthy endpoints.

    graph TB
    subgraph Cluster1["Cluster 1 (us-east-1)"]
        Client1[client-in-ambient]
        Ztunnel1[Client ztunnel]
        Waypoint1["Waypoint proxy<br/>Enforces L7 DestinationRule<br/>(failoverPriority: zone, region)"]
        Backend1["in-ambient pod<br/>(us-east-1a)<br/>✗ Unavailable"]
        Client1 -->|Request to global hostname| Ztunnel1
        Ztunnel1 --> Waypoint1
        Waypoint1 -.->|Unhealthy| Backend1
    end

    subgraph Cluster2["Cluster 2 (us-west-2)"]
        Backend2a["in-ambient pod<br/>(us-west-2a)<br/>✓ Healthy"]
        Backend2b["in-ambient pod<br/>(us-west-2b)<br/>✓ Healthy"]
    end

    Waypoint1 -->|Failover to<br/>next region| Backend2a
    Waypoint1 -->|Failover to<br/>next region| Backend2b

    style Waypoint1 fill:#2068F3,color:#fff
  
  1. Apply the following DestinationRule in both clusters, which configures:

    • Failover priority: Routes to endpoints in the same zone first, then same region, then other regions.
    • Outlier detection: Eject endpoints after 5 consecutive 5xx errors, with a 3-minute base ejection time.
    for ctx in ${context1} ${context2}; do
      kubectl --context=$ctx apply -f- <<EOF
    apiVersion: networking.istio.io/v1
    kind: DestinationRule
    metadata:
      name: in-ambient-failover
      namespace: httpbin
    spec:
      host: in-ambient.httpbin.mesh.internal
      trafficPolicy:
        loadBalancer:
          localityLbSetting:
            enabled: true
            failoverPriority:
            - topology.kubernetes.io/zone
            - topology.kubernetes.io/region
          simple: ROUND_ROBIN
        outlierDetection:
          consecutive5xxErrors: 5
          interval: 10s
          baseEjectionTime: 3m
          maxEjectionPercent: 50
    EOF
    done
  2. Verify that the DestinationRule is applied.

    kubectl --context=${context1} get destinationrule -n httpbin
    kubectl --context=${context2} get destinationrule -n httpbin
  3. Scale down the in-ambient service in cluster 1 to zero replicas to simulate a failure.

    kubectl --context=${context1} scale deployment in-ambient -n httpbin --replicas=0
  4. Send requests to in-ambient to verify that traffic fails over to cluster 2, according to the failover priority.

    kubectl --context=${context1} exec -n httpbin deploy/client-in-ambient -- sh -c "
    for i in \$(seq 1 5); do
      curl -s in-ambient.httpbin.mesh.internal:8000/hostname
    done"

    Example output, in which responses now come from the cluster 2 pod after failover:

    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
    in-ambient-8e9f0c7d65-xyz98
  5. Review the waypoint logs to observe failover events.

    kubectl --context=${context1} logs -n httpbin deploy/httpbin-waypoint | tail -20

    Example output:

    [2025-03-06T17:15:23.456Z] "GET /hostname HTTP/1.1" 200 - via_upstream - "-" 0 32 5 4 "-"
    "curl/7.88.1" "abc123-def456" "in-ambient.httpbin.mesh.internal:8000"
    "10.10.0.15:8080" inbound-vip|8000|http|in-ambient.httpbin.mesh.internal
    10.10.0.14:45678 10.96.45.123:8000 10.10.0.14:45678 - default
  6. Scale the in-ambient service back up in cluster 1.

    kubectl --context=${context1} scale deployment in-ambient -n httpbin --replicas=2

Cleanup

You can optionally remove the resources that you created in this guide.

for ctx in ${context1} ${context2}; do
  kubectl --context=$ctx delete destinationrule in-ambient-failover -n httpbin
  kubectl --context=$ctx delete gateway httpbin-waypoint -n httpbin
  kubectl --context=$ctx delete -f https://raw.githubusercontent.com/solo-io/doc-examples/main/istio/sample-apps/client-in-ambient.yaml
  kubectl --context=$ctx delete -f https://raw.githubusercontent.com/solo-io/doc-examples/main/istio/sample-apps/in-ambient.yaml
done

Next steps

For information about ztunnel outlier detection settings, see ztunnel outlier detection.