Increase resiliency for individual apps

To make your apps more resilient, you can follow the resiliency guides in the community ambient mesh docs to configure retries, timeouts, circuit breakers, outlier detection, and fault injection within the ambient mesh. Note that these guides use Istio resources, such as DestinationRules and VirtualServices, to modify how requests are routed from a gateway or waypoint to your apps.

Resiliency through ztunnel outlier detection

In versions 1.26.1-patch0 and later of the Solo distribution of Istio, outlier detection is enabled by default on ztunnels for outbound client app connections, and connections through east-west network gateways.

About

Outlier detection is performed at the level of the ztunnel to the backend pods. For example, in a single cluster, you might have appA on node1, which sends a request to appB on node2. This request path looks like appA -> node1 ztunnel -> node2 ztunnel -> appB. In this example, the node2 ztunnel performs outlier dection for the appB backends by using TCP connections. This allows for detection of unhealthy destinations, including workloads, network gateways, and network services. The ztunnel then uses Exponentially Weighted Moving Average (EWMA) to ensure failing backends receive fewer connections, and performs circuit breaking to eject backends with a growing backoff for consecutive failures.

In the ztunnel outlier detection system, a backend’s “health” is determined by a TCP connection success or failure. Because the TCP connection checks are performed at the level of the ztunnel to the backend pods, this system does not involve Layer 7 or checking of HTTP status codes. For example, if your server accepts the TCP connection request, but sends an HTTP 500 error, the ztunnel still marks the backend as healthy because it accepts TCP traffic and is responsive. To set up additional outlier detection based on HTTP status codes, you can configure classic outlier detection in an Istio destination rule. See the community ambient mesh docs for outlier detection in waypoint or gateway routing.

Known limitations

East-west gateways

Consider a multicluster request path, in which appA on node1 of cluster1 sends a request to appB on node2 of cluster2. This request path looks like appA -> node1 ztunnel -> east-west gateway (ztunnel) -> node2 ztunnel -> appB. In this example, the node1 ztunnel performs outlier detection on the connection to the cluster2 east-west gateway and the connection to the network service that is behind the east-west gateway. Network service refers to the portion of an app’s global service on some specific network, like the set of backends for the appB.my-ns.mesh.internal global service in cluster2.

However, while the east-west gateway can detect whether the node2 ztunnel is unhealthy, it cannot detect whether appB backends on node2 are unhealthy. Although the east-west gateway is aware of the outer tunnel of the double HBONE protocol that connects to the node2 ztunnel, it is not aware of the inner tunnel that connects all the way to the appB backend.

This might cause a problem when, for example, only a subset of the appB.my-ns.mesh.internal backends (such as on node2) in cluster2 are failing, but other backend instances in cluster2 might be healthy. The east-west gateway is unable to detect which backends are unavailable, and reports all backends of appB as unavailable. Thus, the east-west gateway might report an unhealthy connection for the appB.my-ns.mesh.internal global service in cluster2. The node1 ztunnel for client appA might eject or deprioritize the entire set of backends of appB.my-ns.mesh.internal in the remote network of cluster2 when some proportion of remote backends are unhealthy, rather than ejecting only the unhealthy backends.

Multi-port workloads

The current outlier detection implementation tracks health on a per-workload basis. If one workload serves on multiple ports, any unhealthy port can cause the entire workload to be ejected or deprioritized.