Resiliency
Review options for making your individual apps and overall ambient mesh more resilient.
Increase resiliency for individual apps
To make your apps more resilient, you can follow the resiliency guides in the community ambient mesh docs to configure retries, timeouts, circuit breakers, outlier detection, and fault injection within the ambient mesh. Note that these guides use Istio resources, such as DestinationRules and VirtualServices, to modify how requests are routed from a gateway or waypoint to your apps.
Resiliency through ztunnel outlier detection
In versions 1.26.1-patch0 and later of the Solo distribution of Istio, outlier detection is enabled by default on ztunnels for outbound client app connections, and connections through east-west network gateways.
About
Outlier detection is performed at the level of the ztunnel to the backend pods. For example, in a single cluster, you might have appA
on node1
, which sends a request to appB
on node2
. This request path looks like appA -> node1 ztunnel -> node2 ztunnel -> appB
. In this example, the node2
ztunnel performs outlier dection for the appB
backends by using TCP connections. This allows for detection of unhealthy destinations, including workloads, network gateways, and network services. The ztunnel then uses Exponentially Weighted Moving Average (EWMA) to ensure failing backends receive fewer connections, and performs circuit breaking to eject backends with a growing backoff for consecutive failures.
In the ztunnel outlier detection system, a backend’s “health” is determined by a TCP connection success or failure. Because the TCP connection checks are performed at the level of the ztunnel to the backend pods, this system does not involve Layer 7 or checking of HTTP status codes. For example, if your server accepts the TCP connection request, but sends an HTTP 500 error, the ztunnel still marks the backend as healthy because it accepts TCP traffic and is responsive. To set up additional outlier detection based on HTTP status codes, you can configure classic outlier detection in an Istio destination rule. See the community ambient mesh docs for outlier detection in waypoint or gateway routing.
Known limitations
East-west gateways
Consider a multicluster request path, in which appA
on node1
of cluster1 sends a request to appB
on node2
of cluster2. This request path looks like appA -> node1 ztunnel -> east-west gateway (ztunnel) -> node2 ztunnel -> appB
. In this example, the node1
ztunnel performs outlier detection on the connection to the cluster2 east-west gateway and the connection to the network service that is behind the east-west gateway. Network service refers to the portion of an app’s global service on some specific network, like the set of backends for the appB.my-ns.mesh.internal
global service in cluster2.
However, while the east-west gateway can detect whether the node2
ztunnel is unhealthy, it cannot detect whether appB
backends on node2
are unhealthy. Although the east-west gateway is aware of the outer tunnel of the double HBONE protocol that connects to the node2
ztunnel, it is not aware of the inner tunnel that connects all the way to the appB
backend.
This might cause a problem when, for example, only a subset of the appB.my-ns.mesh.internal
backends (such as on node2
) in cluster2 are failing, but other backend instances in cluster2 might be healthy. The east-west gateway is unable to detect which backends are unavailable, and reports all backends of appB
as unavailable. Thus, the east-west gateway might report an unhealthy connection for the appB.my-ns.mesh.internal
global service in cluster2. The node1
ztunnel for client appA
might eject or deprioritize the entire set of backends of appB.my-ns.mesh.internal
in the remote network of cluster2 when some proportion of remote backends are unhealthy, rather than ejecting only the unhealthy backends.
Multi-port workloads
The current outlier detection implementation tracks health on a per-workload basis. If one workload serves on multiple ports, any unhealthy port can cause the entire workload to be ejected or deprioritized.