Drain clusters in the mesh ENTERPRISE ALPHA
Prevent new connections to clusters in the mesh by using the solo.io/draining-weight annotation.
This feature requires your mesh to be installed with the Solo distribution of Istio and an Enterprise-level license for Gloo Mesh (OSS APIs). Contact your account representative to obtain a valid license.The draining feature is in the alpha state. Alpha features are likely to change, are not fully tested, and are not supported for production. For more information, see Solo feature maturity.
Multicluster maintenance challenges
When you run a multicluster service mesh, performing maintenance on one or multiple clusters can introduce significant challenges. For example, the cluster that needs to be updated might host apps that other services in the mesh depend on, or has a gateway that serves traffic for the services in the mesh. During the update, endpoints might disappear or become stale, which can result in DNS and endpoint lookup failures. To avoid disruption for the services in your service mesh, you must carefully drain any of the existing connections in your cluster before proceeding with the maintenance.
About connection draining
In the Solo distribution of Istio version 1.28 and later, the solo.io/draining-weight annotation was introduced. This annotation allows you to set a draining weight that indicates how much traffic you want to accept for a given cluster.
Review the following weights and how they affect traffic to the cluster.
| Draining mode | Draining weight | Amount of traffic | Draining annotation example | Description |
|---|---|---|---|---|
| Off | 0% | 100% | solo.io/draining-weight: "0" | No draining. The cluster is fully functional and not undergoing any maintance. This is the default setting. |
| Soft | 1-99% | 99-1% | solo.io/draining-weight: "99" | The cluster accepts n% of the traffic, with n being the difference of 100% of traffic and the draining weight. For example, if you set the draining weight to solo.io/draining-weight: "75", the cluster accepts 25% of the overall traffic (100%-75%). This way, you can gradually increase traffic, such as after you finished maintenance on a cluster. |
| Firm | 100% | 0% | solo.io/draining-weight: "100" | No new connections are allowed on the cluster. Note that existing connections are not automatically terminated when the annotation is added. |
Service mesh operators can add the annotation to either an east-west or remote peering gateway. To understand when to use which option, see the Draining use cases.
Draining use cases
To understand common use cases for adding the solo.io/draining-weight annotation, consider the following 3-cluster setup:
cluster1is linked tocluster2through remote peering gatewayscluster3is linked tocluster2through remote peering gatewayscluster1andcluster3are not linked
The client services in cluster1 and cluster3 can connect to the server service in cluster2 by sending requests to cluster2’s east-west gateway.
Cluster maintenance
With cluster1 and cluster3 both connecting to services in cluster2, performing maintenance on cluster2 can become a challenging task. During the maintenance window, you typically want to prevent all incoming connections to this cluster.
To achieve this, you can add the solo.io/draining-weight: "100" annotation to cluster2’s east-west gateway, which serves as the inbound gateway for connections from other peered clusters. The draining weight is automatically applied to all resources that are shared with the remote peering gateway in all linked clusters. This way, istiod in cluster1 and cluster3 can discover affected endpoints from cluster2 and update the local ztunnel configuration appropriately. The ztunnel then removes cluster2 endpoints in the client’s ztunnel socket. Because of this, the client services in cluster1 and cluster3 do not consider the server service in cluster2 as a healthy endpoint.
The draining annotation on the east-west gateway only prevents new connections. Existing connections are not automatically drained. To drain existing connections, it is recommended to scale the east-west gateway in cluster2 to 0 instances.
Gradually add traffic to a cluster
After maintenance is finished, you can gradually add traffic back to cluster2 by reducing the draining weight in the solo.io/draining-weight annotation. The amount of allowed traffic for a cluster is calculated as 100% of traffic minus the draining weight.
For example, to add 25% of traffic back to cluster2, add the solo.io/draining-weight: "75" (100%-75%) annotation to the east-west gateway. You can then further decrease the draining weight until you reach 0, which allows 100% of all traffic back to the cluster. At this point, you can also remove the solo.io/draining-weight annotation from the gateway entirely.
While the amount of traffic that is allowed for cluster2 is determined by the draining weight annotation, the actual amount of traffic that is sent from each of the clients in cluster1 and cluster3 can vary due to several factors. These include the number of other server endpoints that the clients can send traffic to, whether these endpoints are local or remote, and whether specific load balancing algorithms were set.
The following image shows a setup where only one server endpoint exists in cluster2. Allowed traffic is split 50:50 between the clients in cluster1 and cluster3. As a consequence, each client sends approximately 12.5% of traffic to cluster2.
Network issues to remote clusters and local service testing
You might have cases where you cannot or do not want to inform all peered gateways that a cluster is in maintenance mode. For example, you might experience networking issues between cluster1 and cluster2, but cluster3 can connect to cluster2 successfully. However, if you add the solo.io/draining-weight annotation on cluster2’s east-west gateway, you prevent connections from both cluster1 and cluster3.
You might also want to test services without impacting services in other clusters. For example, assume you want to test failover for the client service in cluster1 in the case that cluster2 is not available. However, you do not want to impact the client service in cluster3 so that it can still send traffic to cluster2.
In such cases, you can add the solo.io/draining-weight: "100" annotation to the local remote peering gateway that points to the east-west gateway of the cluster, for which you want to prevent new connections.
Consider the following example where you want to prevent connections from cluster1 to cluster2, but keep the connections from cluster3 to cluster2. To accomplish this, you add the solo.io/draining-weight: "100" to the remote peering gateway in cluster1 that points to cluster2’s east-west gateway.
Limitations
For the draining weight to apply for an east-west gateway, the gateway must have the topology.istio.io/cluster label.
Before you begin
Follow the multi-cluster get started guide to set up two clusters in ambient mode that are linked with remote peering gateways.
Install sample apps
Create the
demonamespace and add it to the ambient mesh. Then, deploy the sleep sample app into it. You use this app as a client to test connectivity to the services in the mesh later.for ctx in $REMOTE_CONTEXT1 $REMOTE_CONTEXT2; do kubectl --context=$ctx create namespace demo kubectl --context=$ctx label namespace demo istio.io/dataplane-mode=ambient kubectl --context=$ctx apply -n demo -f https://raw.githubusercontent.com/solo-io/gloo-mesh-use-cases/refs/heads/main/gloo-mesh/istio-install/manual/flat-network/client/sleep-client.yaml doneDeploy the global service app. The app is configured to print out
Hello version: v1incluster-1andHello version: v2incluster-2. The service has the labelsolo.io/service-scope: global, which exposes the app under a common domain nameglobal-service.demo.mesh.internalacross both of your clusters. For more information, see Make services available across clusters.curl -L https://raw.githubusercontent.com/solo-io/gloo-mesh-use-cases/refs/heads/main/gloo-mesh/istio-install/manual/flat-network/services/global/global-service.yaml -o global-service.yaml sed 's/VERSION_PLACEHOLDER/v1/g' global-service.yaml | kubectl --context=$REMOTE_CONTEXT1 apply -n demo -f - sed 's/VERSION_PLACEHOLDER/v2/g' global-service.yaml | kubectl --context=$REMOTE_CONTEXT2 apply -n demo -f -Apply the
networking.istio.io/traffic-distribution=Anyannotation to the services. This annotation allows requests to the global service to be routed to each service endpoint equally.kubectl --context ${REMOTE_CONTEXT1} annotate service global-service -n demo networking.istio.io/traffic-distribution=Any kubectl --context ${REMOTE_CONTEXT2} annotate service global-service -n demo networking.istio.io/traffic-distribution=AnySend a curl request from the example sleep app in
cluster-1to the global service name. In your CLI output, verify that you see replies from the global service app from both of your clusters.kubectl --context=$REMOTE_CONTEXT1 exec -n demo deploy/sleep -- sh -c " for i in \$(seq 1 10); do curl -s global-service.demo.mesh.internal:5000/hello echo done" | grep -o 'version: v[0-9]' | sort | uniq -cExample output:
5 version: v1 5 version: v2Note that you might see a different CLI output, such as
6 version: v1 4 version: v2. The more requests you send, the closer you get to a 50:50 distribution of requests.
Drain connections
Annotate the east-west gateway in
$REMOTE_CLUSTER2to drain all traffic to the endpoints incluster2.kubectl --context ${REMOTE_CONTEXT2} annotate gateway istio-eastwest \ -n istio-eastwest solo.io/draining-weight=100Repeat the requests from the example sleep app in
cluster-1to the global service name. Verify that this time, you only see responses from the service instance incluster1.kubectl --context=$REMOTE_CONTEXT1 exec -n demo deploy/sleep -- sh -c " for i in \$(seq 1 10); do curl -s global-service.demo.mesh.internal:5000/hello echo done" | grep -o 'version: v[0-9]' | sort | uniq -cExample output:
10 version: v1
Cleanup
You can optionally remove the resources that you set up as part of this guide.Remote the sample apps.
kubectl --context=$REMOTE_CONTEXT1 delete ns demo kubectl --context=$REMOTE_CONTEXT2 delete ns demoRemote the draining annotation from
$REMOTE_CLUSTER2’s east-west gateway.kubectl --context ${REMOTE_CONTEXT2} annotate gateway istio-eastwest \ -n istio-eastwest solo.io/draining-weight-