Multicluster maintenance challenges

When you run a multicluster service mesh, performing maintenance on one or multiple clusters can introduce significant challenges. For example, the cluster that needs to be updated might host apps that other services in the mesh depend on, or has a gateway that serves traffic for the services in the mesh. During the update, endpoints might disappear or become stale, which can result in DNS and endpoint lookup failures. To avoid disruption for the services in your service mesh, you must carefully drain any of the existing connections in your cluster before proceeding with the maintenance.

About connection draining

In the Solo distribution of Istio version 1.28 and later, the solo.io/draining-weight annotation was introduced. This annotation allows you to set a draining weight that indicates how much traffic you want to accept for a given cluster.

Review the following weights and how they affect traffic to the cluster.

Draining modeDraining weightAmount of trafficDraining annotation exampleDescription
Off0%100%solo.io/draining-weight: "0"No draining. The cluster is fully functional and not undergoing any maintance. This is the default setting.
Soft1-99%99-1%solo.io/draining-weight: "99"The cluster accepts n% of the traffic, with n being the difference of 100% of traffic and the draining weight. For example, if you set the draining weight to solo.io/draining-weight: "75", the cluster accepts 25% of the overall traffic (100%-75%). This way, you can gradually increase traffic, such as after you finished maintenance on a cluster.
Firm100%0%solo.io/draining-weight: "100"No new connections are allowed on the cluster. Note that existing connections are not automatically terminated when the annotation is added.

Service mesh operators can add the annotation to either an east-west or remote peering gateway. To understand when to use which option, see the Draining use cases.

Draining use cases

To understand common use cases for adding the solo.io/draining-weight annotation, consider the following 3-cluster setup:

  • cluster1 is linked to cluster2 through remote peering gateways
  • cluster3 is linked to cluster2 through remote peering gateways
  • cluster1 and cluster3 are not linked

The client services in cluster1 and cluster3 can connect to the server service in cluster2 by sending requests to cluster2’s east-west gateway.

Cluster maintenance

With cluster1 and cluster3 both connecting to services in cluster2, performing maintenance on cluster2 can become a challenging task. During the maintenance window, you typically want to prevent all incoming connections to this cluster.

To achieve this, you can add the solo.io/draining-weight: "100" annotation to cluster2’s east-west gateway, which serves as the inbound gateway for connections from other peered clusters. The draining weight is automatically applied to all resources that are shared with the remote peering gateway in all linked clusters. This way, istiod in cluster1 and cluster3 can discover affected endpoints from cluster2 and update the local ztunnel configuration appropriately. The ztunnel then removes cluster2 endpoints in the client’s ztunnel socket. Because of this, the client services in cluster1 and cluster3 do not consider the server service in cluster2 as a healthy endpoint.

Gradually add traffic to a cluster

After maintenance is finished, you can gradually add traffic back to cluster2 by reducing the draining weight in the solo.io/draining-weight annotation. The amount of allowed traffic for a cluster is calculated as 100% of traffic minus the draining weight.

For example, to add 25% of traffic back to cluster2, add the solo.io/draining-weight: "75" (100%-75%) annotation to the east-west gateway. You can then further decrease the draining weight until you reach 0, which allows 100% of all traffic back to the cluster. At this point, you can also remove the solo.io/draining-weight annotation from the gateway entirely.

While the amount of traffic that is allowed for cluster2 is determined by the draining weight annotation, the actual amount of traffic that is sent from each of the clients in cluster1 and cluster3 can vary due to several factors. These include the number of other server endpoints that the clients can send traffic to, whether these endpoints are local or remote, and whether specific load balancing algorithms were set.

The following image shows a setup where only one server endpoint exists in cluster2. Allowed traffic is split 50:50 between the clients in cluster1 and cluster3. As a consequence, each client sends approximately 12.5% of traffic to cluster2.

Network issues to remote clusters and local service testing

You might have cases where you cannot or do not want to inform all peered gateways that a cluster is in maintenance mode. For example, you might experience networking issues between cluster1 and cluster2, but cluster3 can connect to cluster2 successfully. However, if you add the solo.io/draining-weight annotation on cluster2’s east-west gateway, you prevent connections from both cluster1 and cluster3.

You might also want to test services without impacting services in other clusters. For example, assume you want to test failover for the client service in cluster1 in the case that cluster2 is not available. However, you do not want to impact the client service in cluster3 so that it can still send traffic to cluster2.

In such cases, you can add the solo.io/draining-weight: "100" annotation to the local remote peering gateway that points to the east-west gateway of the cluster, for which you want to prevent new connections.

Consider the following example where you want to prevent connections from cluster1 to cluster2, but keep the connections from cluster3 to cluster2. To accomplish this, you add the solo.io/draining-weight: "100" to the remote peering gateway in cluster1 that points to cluster2’s east-west gateway.

Limitations

For the draining weight to apply for an east-west gateway, the gateway must have the topology.istio.io/cluster label.

Before you begin

Follow the multi-cluster get started guide to set up two clusters in ambient mode that are linked with remote peering gateways.

Install sample apps

  1. Create the demo namespace and add it to the ambient mesh. Then, deploy the sleep sample app into it. You use this app as a client to test connectivity to the services in the mesh later.

      for ctx in $REMOTE_CONTEXT1 $REMOTE_CONTEXT2; do
      kubectl --context=$ctx create namespace demo
      kubectl --context=$ctx label namespace demo istio.io/dataplane-mode=ambient
      kubectl --context=$ctx apply -n demo -f https://raw.githubusercontent.com/solo-io/gloo-mesh-use-cases/refs/heads/main/gloo-mesh/istio-install/manual/flat-network/client/sleep-client.yaml
    done
      
  2. Deploy the global service app. The app is configured to print out Hello version: v1 in cluster-1 and Hello version: v2 in cluster-2. The service has the label solo.io/service-scope: global, which exposes the app under a common domain name global-service.demo.mesh.internal across both of your clusters. For more information, see Make services available across clusters.

      curl -L https://raw.githubusercontent.com/solo-io/gloo-mesh-use-cases/refs/heads/main/gloo-mesh/istio-install/manual/flat-network/services/global/global-service.yaml -o global-service.yaml
    sed 's/VERSION_PLACEHOLDER/v1/g' global-service.yaml | kubectl --context=$REMOTE_CONTEXT1 apply -n demo -f -
    sed 's/VERSION_PLACEHOLDER/v2/g' global-service.yaml | kubectl --context=$REMOTE_CONTEXT2 apply -n demo -f -
      
  3. Apply the networking.istio.io/traffic-distribution=Any annotation to the services. This annotation allows requests to the global service to be routed to each service endpoint equally.

      kubectl --context ${REMOTE_CONTEXT1} annotate service global-service -n demo networking.istio.io/traffic-distribution=Any
    kubectl --context ${REMOTE_CONTEXT2} annotate service global-service -n demo networking.istio.io/traffic-distribution=Any
      
  4. Send a curl request from the example sleep app in cluster-1 to the global service name. In your CLI output, verify that you see replies from the global service app from both of your clusters.

      kubectl --context=$REMOTE_CONTEXT1 exec -n demo deploy/sleep -- sh -c "
    for i in \$(seq 1 10); do
      curl -s global-service.demo.mesh.internal:5000/hello
      echo
    done" | grep -o 'version: v[0-9]' | sort | uniq -c
      

    Example output:

      5 version: v1
    5 version: v2
      

    Note that you might see a different CLI output, such as 6 version: v1 4 version: v2. The more requests you send, the closer you get to a 50:50 distribution of requests.

Drain connections

  1. Annotate the east-west gateway in $REMOTE_CLUSTER2 to drain all traffic to the endpoints in cluster2.

      kubectl --context ${REMOTE_CONTEXT2} annotate gateway istio-eastwest \
    -n istio-eastwest solo.io/draining-weight=100
      
  2. Repeat the requests from the example sleep app in cluster-1 to the global service name. Verify that this time, you only see responses from the service instance in cluster1.

      kubectl --context=$REMOTE_CONTEXT1 exec -n demo deploy/sleep -- sh -c "
    for i in \$(seq 1 10); do
      curl -s global-service.demo.mesh.internal:5000/hello
      echo
    done" | grep -o 'version: v[0-9]' | sort | uniq -c
      

    Example output:

      10 version: v1
      

Cleanup

You can optionally remove the resources that you set up as part of this guide.
  1. Remote the sample apps.

      kubectl --context=$REMOTE_CONTEXT1 delete ns demo
    kubectl --context=$REMOTE_CONTEXT2 delete ns demo
      
  2. Remote the draining annotation from $REMOTE_CLUSTER2’s east-west gateway.

      kubectl --context ${REMOTE_CONTEXT2} annotate gateway istio-eastwest \
    -n istio-eastwest solo.io/draining-weight-