Increase resiliency for individual apps

To make your apps more resilient, you can follow the resiliency guides in the community ambient mesh docs to configure retries, timeouts, circuit breakers, outlier detection, and fault injection within the ambient mesh. Note that these guides use Istio resources, such as DestinationRules and VirtualServices, to modify how requests are routed from a gateway or waypoint to your apps.

Resiliency through ztunnel outlier detection

In versions 1.26.1-patch0 and later of the Solo distribution of Istio, outlier detection is enabled by default on ztunnels for outbound client app connections, and connections through east-west network gateways.

About ztunnel outlier detection

Outlier detection is performed at the level of the ztunnel to the backend pods. As the ztunnel socket for an app proxies traffic to the backend, it records whether the connection was successful or not in an EndpointHealth table in ztunnel. This way, ztunnel makes outlier detection decisions for the socket configuration based on actual traffic.

Because the traffic connection checks are performed at the level of the ztunnel to the backend pods, this system does not involve Layer 7 or checking of HTTP status codes. HTTP error responses, such as 503 status codes, are not factored into the current implementation.

For example, if your server accepts the connection request, but sends an HTTP 500 error, the ztunnel still marks the backend as healthy because it accepts traffic and is responsive. To set up additional outlier detection based on HTTP status codes, you can configure outlier detection in an Istio destination rule that is enforced by a waypoint. For more information, see the community ambient mesh docs for outlier detection in waypoint or gateway routing.

If ztunnel has no healthy workloads available during load balancing, it enters a “dazed” state. This state is similar to what Envoy Proxy load balancing calls “panicking”. In this state, all workloads are weighted equally instead of failing outright with an error, “no healthy upstream”.

Configuration modes

Outlier detection for ztunnel is configured according to the following modes:

  • Circuit breaking: Eject backends with a growing backoff for consecutive failures.
  • Exponentially Weighted Moving Average (EWMA): Receive fewer connections for failing backends.
  • Hybrid (default): A mixture of both modes that combines EWMA to ensure failing backends receive fewer connections and circuit breaking to eject backends with a growing backoff for consecutive failures.

Typically, you do not need to adjust the default values for these modes. However, for specific use cases, you can set the following environment variables in the env field of the ztunnel Helm chart.

ztunnel env variable nameDefault valueDescription
EWMA_HEALTH_FLOOR0.05Minimum health score threshold for EWMA.
EWMA_ALPHA0.3Smoothing factor for EWMA calculations. This field emits a warning if you set a value that is technically valid but unsuitable. For example, if the value is greater than 0.5 or less than 0.05, the warning advises that low values cause slow updates and high values cause jitteriness.
CIRCUIT_BREAKER_FAILURE_THRESHOLD2Number of consecutive failures before circuit breaking is triggered.
CIRCUIT_BREAKER_BASE_BACKOFF_SECS60Base backoff duration in seconds for circuit breaking.
CIRCUIT_BREAKER_MAX_BACKOFF_SECS300Maximum backoff duration in seconds for circuit breaking.
CIRCUIT_BREAKER_JITTER_FRAC0.2Jitter fraction applied to backoff calculations.

Disable outlier detection

You can disable outlier detection behaviors by setting the environment variables to negative values.

  • EWMA behaviors: Set either EWMA_HEALTH_FLOOR or EWMA_ALPHA to a negative value.
  • Circuit breaking behaviors: Set either CIRCUIT_BREAKER_BASE_BACKOFF_SECS or CIRCUIT_BREAKER_MAX_BACKOFF_SECS to a negative value.
  • All ztunnel outlier detection: Disable both EWMA and circuit breaking by setting the respective variables to negative values.

Known limitations

Review the following known limitations of ztunnel outlier detection.

East-west gateways

The ztunnel-based east-west gateway used for ambient multi-network cannot detect failures on the hop between the gateway and the backend if the backend application is causing the failure. It can only detect failures connecting to the server zTunnel.

Consider a multicluster request path, in which appA on node1 of cluster1 sends a request to appB on node2 of cluster2.

In this example, the node1 ztunnel performs outlier detection on the connection to the cluster2 east-west gateway and the connection to the network service that is behind the cluster2 east-west gateway. Network service refers to the portion of an app’s global service on some specific network, like the set of backends for the appB.my-ns.mesh.internal global service in cluster2.

However, while the east-west gateway can detect whether the node2 ztunnel is unhealthy, it cannot detect whether appB backends on node2 are unhealthy. Although the east-west gateway is aware of the outer tunnel of the double HBONE protocol that connects to the node2 ztunnel, it is not aware of the inner tunnel that connects all the way to the appB backend.

This might cause a problem when, for example, only a subset of the appB.my-ns.mesh.internal backends (such as on node2) in cluster2 are failing, but other backend instances in cluster2 might be healthy. The east-west gateway is unable to detect which backends are unavailable, and reports all backends of appB as unavailable. Thus, the east-west gateway might report an unhealthy connection for the appB.my-ns.mesh.internal global service in cluster2. The node1 ztunnel for client appA might eject or deprioritize the entire set of backends of appB.my-ns.mesh.internal in the remote network of cluster2 when some proportion of remote backends are unhealthy, rather than ejecting only the unhealthy backends.

Multi-port workloads

The current outlier detection implementation tracks health on a per-workload basis. If one workload serves on multiple ports, any unhealthy port can cause the entire workload to be ejected or deprioritized.