Reduce transient failures and hanging systems by setting retries and timeouts. For more information, see the API docs.

About

You can use failover, outlier detection, and retry timeout policies together to build a more resilient application network. For example, an outlier detection policy can remove unhealthy destinations, a failover policy can redirect traffic to healthy destinations, and a retry policy can retry requests in case of failure. Review the following table to understand what each policy does.

PolicyPurpose
FailoverChoose destinations to re-route traffic to, based on the closest locality.
Outlier detectionDetermine when and for how long to remove unhealthy destinations from the pool of healthy destinations.
Retry timeoutDecide how many times to retry requests before the outlier detection policy considers the request as failing and removes the service from the pool of healthy destinations.

About timeouts

A timeout is the amount of time that an Envoy proxy waits for replies from a service, ensuring that services don’t hang around waiting for replies forever. This allows calls to succeed or fail within a predictable timeframe.

By default, the Envoy timeout for HTTP requests is disabled in Istio. This impacts the default timeouts depending on the type of gateway as follows:

  • For north-south traffic through the ingress gateway, no default timeout is applied.
  • For service mesh traffic through the Istio east-west gateway, the Istio default timeout applies. For some applications and services, Istio’s default timeout might not be appropriate. For example, a timeout that is too long can result in excessive latency from waiting for replies from failing services. On the other hand, a timeout that is too short can result in calls failing unnecessarily while waiting for an operation that needs responses from multiple services.

To find and use your optimal timeout settings, you can set timeouts dynamically per route with Gloo’s retry timeout policy.

For more information, see the Istio documentation.

About retries

A retry specifies the maximum number of times a gateway’s Envoy proxy attempts to connect to an upstream service if the initial call fails. Retries can enhance service availability and application performance by making sure that calls don’t fail permanently because of transient problems such as a temporarily overloaded service or network.

The interval between retries (25ms+) is variable and determined automatically by the gateway, to prevent the called service from being overwhelmed with requests. The default retry behavior for HTTP requests is to retry twice before returning the error.

Like timeouts, the gateway’s default retry behavior might not suit your application needs in terms of latency or availability. For example, too many retries to a failed service can slow things down. Also like timeouts, you can adjust your retry settings on a per-route basis.

Before you begin

  1. Set up Gloo Mesh Gateway in a single cluster.
  2. Install Bookinfo and other sample apps.
  3. Configure an HTTP listener on your gateway and set up basic routing for the sample apps.

Configure retry and timeout policies

You can apply a retry or timeout policy at the route level. For more information, see Applying policies.

Review the following sample configuration files.

Verify retry and timeout policies

  1. Apply the previous example retry policy in the cluster with the Bookinfo workspace in your example setup.

      kubectl apply -f - <<EOF
    apiVersion: resilience.policy.gloo.solo.io/v2
    kind: RetryTimeoutPolicy
    metadata:
      name: retry-only
      namespace: bookinfo
    spec:
      applyToRoutes:
        - route:
            labels:
              route: ratings # matches on route table route's labels
      config:
        retries:
          attempts: 5 # optional (default is 2)
          perTryTimeout: 2s
          # retryOn specifies the conditions under which retry takes place. One or more policies can be specified using a ‘,’ delimited list.
          retryOn: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes"
          # retryRemoteLocalities specifies whether the retries should retry to other localities, will default to false
          retryRemoteLocalities: true 
    EOF
      
  2. Verify that you can still send requests to the ratings app.

    • HTTP:
        curl -vik --resolve www.example.com:80:${INGRESS_GW_ADDRESS} http://www.example.com:80/ratings/1
        
    • HTTPS:
        curl -vik --resolve www.example.com:443:${INGRESS_GW_ADDRESS} https://www.example.com:443/ratings/1
        
  3. Send the ratings app to sleep to mimic an unresponsive app.

      kubectl -n bookinfo patch deploy ratings-v1 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"ratings","command":["sleep","20h"]}]}}}}' 
      
  4. Verify that requests to ratings now fail.

    • HTTP:
        curl -vik --resolve www.example.com:80:${INGRESS_GW_ADDRESS} http://www.example.com:80/ratings/1
        
    • HTTPS:
        curl -vik --resolve www.example.com:443:${INGRESS_GW_ADDRESS} https://www.example.com:443/ratings/1
        
  5. Optional: Check the ingress gateway logs.

      kubectl logs deploy/ingress-gateway -n gloo-mesh-gateways
      

Cleanup

You can optionally remove the resources that you set up as part of this guide.
  kubectl delete RetryTimeoutPolicy retry-only -n bookinfo
kubectl -n bookinfo patch deploy ratings-v1 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"ratings","command":[]}]}}}}'