Resiliency
Review options for making your individual apps and overall ambient mesh more resilient.
Increase resiliency for individual apps
To make your apps more resilient, you can follow the resiliency guides in the community ambient mesh docs to configure retries, timeouts, circuit breakers, outlier detection, and fault injection within the ambient mesh. Note that these guides use Istio resources, such as DestinationRules and VirtualServices, to modify how requests are routed from a gateway or waypoint to your apps.
Resiliency through ztunnel outlier detection
In versions 1.26.1-patch0 and later of the Solo distribution of Istio, outlier detection is enabled by default on ztunnels for outbound client app connections, and connections through east-west network gateways.
About ztunnel outlier detection
Outlier detection is performed at the level of the ztunnel to the backend pods. As the ztunnel socket for an app proxies traffic to the backend, it records whether the connection was successful or not in an EndpointHealth table in ztunnel. This way, ztunnel makes outlier detection decisions for the socket configuration based on actual traffic.
Because the traffic connection checks are performed at the level of the ztunnel to the backend pods, this system does not involve Layer 7 or checking of HTTP status codes. HTTP error responses, such as 503 status codes, are not factored into the current implementation.
For example, if your server accepts the connection request, but sends an HTTP 500 error, the ztunnel still marks the backend as healthy because it accepts traffic and is responsive. To set up additional outlier detection based on HTTP status codes, you can configure outlier detection in an Istio destination rule that is enforced by a waypoint. For more information, see the community ambient mesh docs for outlier detection in waypoint or gateway routing.
If ztunnel has no healthy workloads available during load balancing, it enters a “dazed” state. This state is similar to what Envoy Proxy load balancing calls “panicking”. In this state, all workloads are weighted equally instead of failing outright with an error, “no healthy upstream”.
Configuration modes
Outlier detection for ztunnel is configured according to the following modes:
- Circuit breaking: Eject backends with a growing backoff for consecutive failures.
- Exponentially Weighted Moving Average (EWMA): Receive fewer connections for failing backends.
- Hybrid (default): A mixture of both modes that combines EWMA to ensure failing backends receive fewer connections and circuit breaking to eject backends with a growing backoff for consecutive failures.
Typically, you do not need to adjust the default values for these modes. However, for specific use cases, you can set the following environment variables in the env field of the ztunnel Helm chart.
ztunnel env variable name | Default value | Description |
|---|---|---|
EWMA_HEALTH_FLOOR | 0.05 | Minimum health score threshold for EWMA. |
EWMA_ALPHA | 0.3 | Smoothing factor for EWMA calculations. This field emits a warning if you set a value that is technically valid but unsuitable. For example, if the value is greater than 0.5 or less than 0.05, the warning advises that low values cause slow updates and high values cause jitteriness. |
CIRCUIT_BREAKER_FAILURE_THRESHOLD | 2 | Number of consecutive failures before circuit breaking is triggered. |
CIRCUIT_BREAKER_BASE_BACKOFF_SECS | 60 | Base backoff duration in seconds for circuit breaking. |
CIRCUIT_BREAKER_MAX_BACKOFF_SECS | 300 | Maximum backoff duration in seconds for circuit breaking. |
CIRCUIT_BREAKER_JITTER_FRAC | 0.2 | Jitter fraction applied to backoff calculations. |
Disable outlier detection
You can disable outlier detection behaviors by setting the environment variables to negative values.
- EWMA behaviors: Set either
EWMA_HEALTH_FLOORorEWMA_ALPHAto a negative value. - Circuit breaking behaviors: Set either
CIRCUIT_BREAKER_BASE_BACKOFF_SECSorCIRCUIT_BREAKER_MAX_BACKOFF_SECSto a negative value. - All ztunnel outlier detection: Disable both EWMA and circuit breaking by setting the respective variables to negative values.
Known limitations
Review the following known limitations of ztunnel outlier detection.
East-west gateways
The ztunnel-based east-west gateway used for ambient multi-network cannot detect failures on the hop between the gateway and the backend if the backend application is causing the failure. It can only detect failures connecting to the server zTunnel.
Consider a multicluster request path, in which appA on node1 of cluster1 sends a request to appB on node2 of cluster2.
In this example, the node1 ztunnel performs outlier detection on the connection to the cluster2 east-west gateway and the connection to the network service that is behind the cluster2 east-west gateway. Network service refers to the portion of an app’s global service on some specific network, like the set of backends for the appB.my-ns.mesh.internal global service in cluster2.
However, while the east-west gateway can detect whether the node2 ztunnel is unhealthy, it cannot detect whether appB backends on node2 are unhealthy. Although the east-west gateway is aware of the outer tunnel of the double HBONE protocol that connects to the node2 ztunnel, it is not aware of the inner tunnel that connects all the way to the appB backend.
This might cause a problem when, for example, only a subset of the appB.my-ns.mesh.internal backends (such as on node2) in cluster2 are failing, but other backend instances in cluster2 might be healthy. The east-west gateway is unable to detect which backends are unavailable, and reports all backends of appB as unavailable. Thus, the east-west gateway might report an unhealthy connection for the appB.my-ns.mesh.internal global service in cluster2. The node1 ztunnel for client appA might eject or deprioritize the entire set of backends of appB.my-ns.mesh.internal in the remote network of cluster2 when some proportion of remote backends are unhealthy, rather than ejecting only the unhealthy backends.
Multi-port workloads
The current outlier detection implementation tracks health on a per-workload basis. If one workload serves on multiple ports, any unhealthy port can cause the entire workload to be ejected or deprioritized.