Multicluster peering
Review best practices and recommendations for using multicluster peering.
Multicluster peering relies on the east-west gateway pods that are provisioned in each cluster. To ensure high availability and avoid data plane outages, follow these best practices for managing node lifecycle events, configuring data plane resiliency, and understanding automatic health detection.
These practices apply to both LoadBalancer and NodePort peering methods, with NodePort having additional node-level considerations that are covered in the control plane resiliency and automatic health detection sections.
LoadBalancer vs NodePort for east-west and peering gateways
In each cluster, you create an east-west gateway, which is implemented as a ztunnel that facilitates traffic between services across clusters in your multicluster mesh. In the Solo distribution of Istio 1.28 and later, you can use either LoadBalancer or NodePort addresses to resolve cross-cluster traffic requests through this gateway. Note that the NodePort method is considered beta in the Solo distribution of Istio version 1.29.
LoadBalancer: In the standard LoadBalancer peering method, cross-cluster traffic through the east-west gateway resolves to its LoadBalancer address.
NodePort (beta): If you prefer to use direct pod-to-pod traffic across clusters, you can annotate the east-west and peering gateways so that cross-cluster traffic resolves to NodePort addresses. This method allows you to avoid LoadBalancer services to reduce cross-cluster traffic costs. Review the following considerations:
- Note that the gateways must still be created with stable IP addresses, which are required for xDS communication with the istiod control plane in each cluster. NodePort peering is used for data-plane communication, in that requests to services resolve to the NodePort instead of the LoadBalancer address. Also, the east-west gateway must have the
topology.istio.io/clusterlabel. - If a node in a target cluster becomes inaccessible, such as during a restart or replacement, a delay can occur in the connection from the client cluster that must become aware of the new east-west gateway NodePort. In this case, you might see a connection error when trying to send cross-cluster traffic to an east-west gateway that is no longer accepting connections.
- Only nodes where an east-west gateway pod is provisioned are considered targets for traffic.
- NodePort peering uses only InternalIP node addresses. Ensure that your environment is configured so that nodes are reachable via their InternalIP address. If you need to use another address type, such as ExternalIP, contact Solo for engineering design and implementation support.
- Like LoadBalancer gateways, NodePort gateways support traffic from Envoy-based ingress gateways, waypoints, and sidecars.
- This feature is in a beta state. For more information, see Solo feature maturity.
- This feature is only supported for default network setups, and is not applicable in flat network setups.
Configuration steps for using either LoadBalancer or NodePort addresses are included in all multicluster peering guides.
Manage node lifecycle events
Node lifecycle events occur during Kubernetes upgrades and can affect east-west gateway availability. During a Kubernetes upgrade for a specific node, the node is first cordoned, then drained of its pods; the node is restarted; and finally the node is uncordoned, where pod scheduling begins on the node.
To control where the east-west gateway pods are provisioned and how many are available at any point in time on the cluster, specify node affinity on the east-west gateway deployment and introduce a PodDisruptionBudget.
Configure node affinity
Use node affinity to choose which nodes are eligible for scheduling east-west gateway pods. Node affinity acts as a filter: you include or exclude specific nodes. For example, you might want to exclude control plane nodes or nodes that you put into maintenance mode from scheduling east-west gateway pods. For more information on node affinity, see the Kubernetes assign Pods to nodes documentation.
To specify pod affinity for east-west gateways, create a ConfigMap with the gateway.istio.io/defaults-for-class label and value of istio-eastwest, which corresponds to the GatewayClass that the east-west gateway is based on. Then, define your pod affinity by using deployment template overrides. Alternatively, if you use the peering Helm chart to deploy east-west gateways, you can set node affinity in your Helm values file.
To find node hostnames, run:
kubectl get nodes -o jsonpath='{.items[*].metadata.labels.kubernetes\.io/hostname}'
If you have more east-west gateway replicas than available nodes after node affinity exclusions are applied, excess pods can remain in the Pending state indefinitely.
Configure a PodDisruptionBudget
When multiple nodes must be upgraded, use a PodDisruptionBudget to specify the minimum number of pods that must be available at a time on the cluster. The PodDisruptionBudget adds protection during node draining events (not catastrophic node failure) and blocks draining until the minimum number of pods are in a Ready state on the cluster. This allows nodes to upgrade simultaneously without making the east-west gateway unavailable.
Karpenter NodePool expiration considerations
If you use Karpenter for automated node provisioning and decommissioning, be aware of how NodePool expiration can affect east-west gateway availability. Karpenter uses NodePool custom resources to define blueprints for what kind of nodes to create and under what constraints. The expireAfter field in a NodePool specifies when nodes can be removed and new ones provisioned.
If you define multiple NodePool resources with the same expireAfter value, nodes expire simultaneously. When this happens, pods on those nodes enter a Pending state until new nodes are provisioned and the pods are scheduled to them. If all east-west gateway pods are on nodes that expire at the same time, data plane traffic drops. This applies to both LoadBalancer and NodePort peering methods due to the unavailability of the gateway pods, not the routing method.
To prevent this scenario, stagger the expireAfter values across your NodePool resources so that nodes do not expire simultaneously. In addition, use a PodDisruptionBudget to ensure that a minimum number of east-west gateway pods remain available throughout the node expiration and replacement process.
Data plane resiliency
To ensure high availability and avoid data plane outages, run the east-west gateway in a multi-replica state with pod anti-affinity. Unlike node affinity (which filters which nodes are eligible), pod anti-affinity controls where replicas are placed relative to each other: it spreads east-west gateway pods across different nodes so that no two gateway pods run on the same node. For more information on pod affinity and anti-affinity, see the Kubernetes assign Pods to nodes documentation. That way, if one node fails, the other replica(s) remain available and data plane traffic can still traverse at least one path in a multi-cluster scenario.
Within pod anti-affinity, the topologyKey must exist on all relevant nodes. For zone-level resilience in multi-zone deployments, consider using topology.kubernetes.io/zone instead of kubernetes.io/hostname.
Control plane resiliency (NodePort only)
When NodePort peering is enabled, if a node fails non-gracefully and an istiod pod is provisioned to that node, config propagation can be disrupted. This can affect the ability of the control plane to propagate configuration updates to workloads across clusters.
To mitigate this risk, run istiod in a multi-replica deployment with pod anti-affinity or topology spread constraints. This ensures that istiod pods are distributed across different nodes, providing resiliency against catastrophic node failure. With LoadBalancer peering, the LoadBalancer IP abstracts away node-level details, so this configuration is not required.
The following example shows how to configure istiod with pod anti-affinity using Helm values. This configuration uses preferredDuringSchedulingIgnoredDuringExecution to prefer spreading istiod pods across different zones, but allows scheduling on the same node if necessary.
pilot:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: istio.io/rev
operator: In
values:
- main
topologyKey: topology.kubernetes.io/zone
weight: 100
enabled: true
For stricter node-level distribution, you can use requiredDuringSchedulingIgnoredDuringExecution with kubernetes.io/hostname as the topology key. Alternatively, you can use topology spread constraints instead of pod anti-affinity to achieve similar distribution goals.
When using requiredDuringSchedulingIgnoredDuringExecution, ensure that you have enough nodes to satisfy the anti-affinity requirements. If you request more replicas than available nodes after applying the constraints, excess pods can remain in the Pending state indefinitely.
Automatic node health detection (NodePort only)
When NodePort peering is enabled, Istio monitors node health status. When a node transitions to a NotReady state (for example, during catastrophic failure), Istio excludes the node from traffic routing. This health detection is specific to NodePort peering because traffic routes directly to nodes. With LoadBalancer peering, the LoadBalancer IP abstracts away node-level details, so this health detection is not needed.
How quickly an unhealthy node transitions from Ready to NotReady depends on the kube-controller-manager timeout that is set. By default, the timeout is set to 40 seconds. You can change the timeout by updating the --node-monitor-grace-period flag in the kube-controller-manager manifest or Helm values.
For example, you might want to set a shorter value to detect node failure faster. However, keep in mind that shorter values can cause false positives during brief network blips. A longer value tolerates transient issues but delays excluding truly failed nodes.
Complete configuration example
Applying all of these best practices results in the following example resources. These configurations apply to both LoadBalancer and NodePort peering methods. Note that the automatic node health detection feature and control plane resiliency configuration described earlier are specific to NodePort peering only.
For NodePort peering, you can also configure istiod with multi-replica deployment and pod anti-affinity or topology spread as described in the Configure control plane resiliency section.