Retry and timeout

Reduce transient failures and hanging systems by setting retries and timeouts. For more information, see the API docs.

If you import or export resources across workspaces, your policies might not apply. For more information, see Import and export policies.

About timeouts

A timeout is the amount of time that an Envoy proxy waits for replies from a service, ensuring that services don’t hang around waiting for replies forever. This allows calls to succeed or fail within a predictable timeframe.

By default, the Envoy timeout for HTTP requests is disabled in Istio. For some applications and services, Istio’s default timeout might not be appropriate.

For example, a timeout that is too long can result in excessive latency from waiting for replies from failing services. On the other hand, a timeout that is too short can result in calls failing unnecessarily while waiting for an operation that needs responses from multiple services.

To find and use your optimal timeout settings, you can set timeouts dynamically per route.

For more information, see the Istio documentation.

About retries

A retry specifies the maximum number of times an Envoy proxy attempts to connect to a service if the initial call fails. Retries can enhance service availability and application performance by making sure that calls don’t fail permanently because of transient problems such as a temporarily overloaded service or network.

The interval between retries (25ms+) is variable and determined automatically by Istio, to prevent the called service from being overwhelmed with requests. The default retry behavior for HTTP requests is to retry twice before returning the error.

Like timeouts, Istio’s default retry behavior might not suit your application needs in terms of latency or availability. For example, too many retries to a failed service can slow things down. Also like timeouts, you can adjust your retry settings on a per-route basis.

For more information, see the Istio documentation.

Before you begin

This guide assumes that you use the same names for components like clusters, workspaces, and namespaces as in the getting started. If you have different names, make sure to update the sample configuration files in this guide.
  1. Complete the multicluster getting started guide to set up the following testing environment.
    • Three clusters along with environment variables for the clusters and their Kubernetes contexts.
    • The Gloo Platform CLI, meshctl, along with other CLI tools such as kubectl and istioctl.
    • The Gloo management server in the management cluster, and the Gloo agents in the workload clusters.
    • Istio installed in the workload clusters.
    • A simple Gloo workspace setup.
  2. Install Bookinfo and other sample apps.

Configure retry and timeout policies

You can apply a retry or timeout policy at the route level. For more information, see Applying policies.

Review the following sample configuration files.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: RetryTimeoutPolicy
metadata:
  name: retry-only
  namespace: bookinfo
  annotations:
    cluster.solo.io/cluster: $REMOTE_CLUSTER1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings # matches on route table route's labels
  config:
    retries:
      attempts: 5 # optional (default is 2)
      perTryTimeout: 2s
      # retryOn specifies the conditions under which retry takes place. One or more policies can be specified using a ‘,’ delimited list.
      retryOn: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes"
      # retryRemoteLocalities specifies whether the retries should retry to other localities, will default to false
      retryRemoteLocalities: true
apiVersion: resilience.policy.gloo.solo.io/v2
kind: RetryTimeoutPolicy
metadata:
  name: retry-timeout
  namespace: bookinfo
  annotations:
    cluster.solo.io/cluster: $REMOTE_CLUSTER1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings # matches on route table route's labels
  config:
    requestTimeout: 2s

Verify retry and timeout policies

  1. Apply the example retry policy in the cluster with the Bookinfo workspace in your example setup.

    kubectl apply --context ${REMOTE_CONTEXT1} -f - << EOF
    apiVersion: resilience.policy.gloo.solo.io/v2
    kind: RetryTimeoutPolicy
    metadata:
      name: retry-only
      namespace: bookinfo
      annotations:
        cluster.solo.io/cluster: $REMOTE_CLUSTER1
    spec:
      applyToRoutes:
        - route:
            labels:
              route: reviews # matches on route table route's labels
      config:
        retries:
          attempts: 5 # optional (default is 2)
          perTryTimeout: 2s
          # retryOn specifies the conditions under which retry takes place. One or more policies can be specified using a ‘,’ delimited list.
          retryOn: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes,5xx"
          # retryRemoteLocalities specifies whether the retries should retry to other localities, will default to false
          retryRemoteLocalities: true
    EOF
    
  2. Create a route table for the reviews app. Because retry and timeout policies apply at the route level, Gloo checks for the route in a route table resource.

    kubectl apply --context ${REMOTE_CONTEXT1} -f - << EOF
    apiVersion: networking.gloo.solo.io/v2
    kind: RouteTable
    metadata:
      name: reviews-rt
      namespace: bookinfo
    spec:
      hosts:
      - reviews
      http:
      - forwardTo:
          destinations:
          - ref:
              name: reviews
              namespace: bookinfo
              cluster: ${REMOTE_CLUSTER1}
        labels:
          route: reviews
      workloadSelectors:
      - {}
    EOF
    

    Review the following table to understand this configuration. For more information, see the API docs.

    Setting Description
    hosts The host that the route table routes traffic for. In this example, the ratings host matches the ratings service within the mesh.
    http.forwardTo.destinations The destination to forward requests that come in along the host route. In this example, the ratings service is selected.
    http.labels The label for the route. This label must match the label that the policy selects.
    workloadSelectors The source workloads within the mesh that this route table routes traffic for. In the example, all workloads are selected. This way, the curl container that you create in subsequent steps can send a request along the ratings route.
  3. Send the reviews v1 and v2 apps to sleep, to mimic an app failure.

    kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v1 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":["sleep","20h"]}]}}}}'
    kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v2 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":["sleep","20h"]}]}}}}'
    
  4. Enable Istio debug logging on the reviews v1 app.

    istioctl pc log --level debug deploy/reviews-v1 -n bookinfo --context $REMOTE_CONTEXT1 
    
  5. Send a request to the reviews app from within the mesh.

    Create a temporary curl pod in the bookinfo namespace, so that you can test the app setup. You can also use this method in Kubernetes 1.23 or later, but an ephemeral container might be simpler, as shown in the other tab.

    1. Create the curl pod.
      kubectl run -it -n bookinfo --context $REMOTE_CONTEXT1 curl \
        --image=curlimages/curl:7.73.0 --rm  -- sh
      
    2. Send a request to the reviews app.
      curl -v http://reviews:9080/reviews/1
      
    3. Exit the temporary pod. The pod deletes itself.
      exit
      

    Use the kubectl debug command to create an ephemeral curl container in the deployment. This way, the curl container inherits any permissions from the app that you want to test. If you don't run Kubernetes 1.23 or later, you can deploy a separate curl pod or manually add the curl container as shown in the other tab.

    kubectl --context ${REMOTE_CONTEXT1} -n bookinfo debug -i pods/$(kubectl get pod --context ${REMOTE_CONTEXT1} -l app=reviews -A -o jsonpath='{.items[0].metadata.name}') --image=curlimages/curl -- curl -v http://reviews:9080/reviews/1
    

    If the output has an error about EphemeralContainers, see Ephemeral containers don’t work when testing Bookinfo.

  6. Verify that you notice the retries in the logs for the reviews v1 app.

    kubectl logs deploy/reviews-v1 -c istio-proxy -n bookinfo --context $REMOTE_CONTEXT1 
    

    Example output:

    'x-envoy-attempt-count', '5'
    
  7. Optional: Clean up the resources that you created.

    istioctl pc log -n bookinfo --context $REMOTE_CONTEXT1 --level off deploy/reviews-v1
    kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v1 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":[]}]}}}}'
    kubectl --context ${REMOTE_CONTEXT1} -n bookinfo patch deploy reviews-v2 --patch '{"spec":{"template":{"spec":{"containers":[{"name":"reviews","command":[]}]}}}}'
    kubectl --context $REMOTE_CONTEXT1 -n bookinfo delete routetable reviews-rt
    kubectl --context $REMOTE_CONTEXT1 -n bookinfo delete RetryTimeoutPolicy retry-only