Retry and timeout

Reduce transient failures and hanging systems by setting retries and timeouts. For more information, see the API docs.

About timeouts

A timeout is the amount of time that an Envoy proxy waits for replies from a service, ensuring that services don’t hang around waiting for replies forever. This allows calls to succeed or fail within a predictable timeframe.

By default, the Envoy timeout for HTTP requests is disabled in Istio. For some applications and services, Istio’s default timeout might not be appropriate.

For example, a timeout that is too long can result in excessive latency from waiting for replies from failing services. On the other hand, a timeout that is too short can result in calls failing unnecessarily while waiting for an operation that needs responses from multiple services.

To find and use your optimal timeout settings, you can set timeouts dynamically per route.

For more information, see the Istio documentation.

About retries

A retry specifies the maximum number of times an Envoy proxy attempts to connect to a service if the initial call fails. Retries can enhance service availability and application performance by making sure that calls don’t fail permanently because of transient problems such as a temporarily overloaded service or network.

The interval between retries (25ms+) is variable and determined automatically by Istio, to prevent the called service from being overwhelmed with requests. The default retry behavior for HTTP requests is to retry twice before returning the error.

Like timeouts, Istio’s default retry behavior might not suit your application needs in terms of latency or availability. For example, too many retries to a failed service can slow things down. Also like timeouts, you can adjust your retry settings on a per-route basis.

For more information, see the Istio documentation.

Before you begin

  1. Complete the demo setup to install Gloo Mesh, Istio, and Bookinfo in your cluster.

  2. Create the Gloo Mesh resources for this policy in the management and workload clusters.

    The following files are examples only for testing purposes. Your actual setup might vary. You can use the files as a reference for creating your own tests.

    1. Download the following Gloo Mesh resources:
    2. Apply the files to your management cluster.
      kubectl apply -f kubernetes-cluster_gloo-mesh_cluster-1.yaml --context ${MGMT_CONTEXT}
      kubectl apply -f kubernetes-cluster_gloo-mesh_cluster-2.yaml --context ${MGMT_CONTEXT}
      kubectl apply -f workspace_gloo-mesh_anything.yaml --context ${MGMT_CONTEXT}
      
    1. Download the following Gloo Mesh resources:
    2. Apply the files to your workload cluster.
      kubectl apply -f route-table_bookinfo_www-example-com.yaml --context ${REMOTE_CONTEXT1}
      kubectl apply -f virtual-gateway_bookinfo_north-south-gw.yaml --context ${REMOTE_CONTEXT1}
      kubectl apply -f workspace-settings_bookinfo_anything.yaml --context ${REMOTE_CONTEXT1}
      

Configure retry and timeout policies

You can apply a retry or timeout policy at the route level. For more information, see Applying policies.

Review the following sample configuration files.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: RetryTimeoutPolicy
metadata:
  name: retry-only
  namespace: bookinfo
  clusterName: cluster-1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings # matches on route table route's labels
  config:
    retries:
      attempts: 5 # optional (default is 2)
      perTryTimeout: 2s
      # retryOn specifies the conditions under which retry takes place. One or more policies can be specified using a ‘,’ delimited list.
      retryOn: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes"
      # retryRemoteLocalities specifies whether the retries should retry to other localities, will default to false
      retryRemoteLocalities: true
apiVersion: resilience.gloo.solo.io/v2
kind: RetryTimeoutPolicy
metadata:
  name: retry-timeout
  namespace: bookinfo
  clusterName: cluster-1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings # matches on route table route's labels
  config:
    requestTimeout: 2s
    retryPolicy:
      attempts: 5
      # if perTryTimeout is not set this will default to the requestTimeout value

Verify retry and timeout policies

  1. Apply the previous example timeout policy in the cluster with the Bookinfo workspace in your example setup.
    kubectl apply --context ${REMOTE_CONTEXT1} -f <file.yaml>
    
  2. Send a request to the ratings app through the ingress gateway.
    curl -vik --connect-timeout 1 --max-time 5 --resolve www.example.com:32010:127.0.0.1 https://www.example.com:32010/ratings/1
    
  3. Verify that you notice the retry from the previous example. In this example, all inbound requests to the ratings service try 5 times, and an attempt is marked as failed if it takes longer than 2 seconds to complete.