Fault injection

Test the resilience of your apps by injecting delays and connection failures.

Inject faults in a percentage of your requests to test how your app handles the errors. By using the policy, you can avoid deleting pods, delaying packets, or corrupting packets.

You can set two types of faults injection:

For more information, see the following resources.

If you import or export resources across workspaces, your policies might not apply. For more information, see Import and export policies.

Before you begin

This guide assumes that you use the same names for components like clusters, workspaces, and namespaces as in the getting started. If you have different names, make sure to update the sample configuration files in this guide.
  1. Complete the multicluster getting started guide to set up the following testing environment.
    • Three clusters along with environment variables for the clusters and their Kubernetes contexts.
    • The Gloo Platform CLI, meshctl, along with other CLI tools such as kubectl and istioctl.
    • The Gloo management server in the management cluster, and the Gloo agents in the workload clusters.
    • Istio installed in the workload clusters.
    • A simple Gloo workspace setup.
  2. Install Bookinfo and other sample apps.

Configure fault injection policies

You can apply a fault injection policy at the route level. For more information, see Applying policies.

Review the following sample configuration files.

The following example is for a simple fault injection abort policy with a default value for the percentage. No delay is configured.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: FaultInjectionPolicy
metadata:
  annotations:
    cluster.solo.io/cluster: ""
  name: faultinjection-basic
  namespace: bookinfo
spec:
  applyToRoutes:
  - route:
      labels:
        route: ratings
  config:
    abort:
      httpStatus: 418

Review the following table to understand this configuration. For more information, see the API docs.

Setting Description
spec.applyToRoutes Configure which routes to apply the policy to, by using labels. The label matches the app and the route from the route table. If omitted, the policy applies to all routes in the workspace.
spec.config.abort Because no percentage field is set, the policy defaults to aborting 100% of requests. The httpStatus field sets the HTTP status code to return for an aborted request, such as 418. The value must be an integer in the range [200, 600]. For HTTP response status codes, see the mdn web docs.

The following example is for a simple fault injection delay policy with a default value for the percentage. No abort is configured.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: FaultInjectionPolicy
metadata:
  name: faultinjection-basic-delay
  namespace: bookinfo
  annotations:
    cluster.solo.io/cluster: $REMOTE_CLUSTER1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings
  config:
    delay:
      fixedDelay: 5s

Review the following table to understand this configuration. For more information, see the API docs.

Setting Description
spec.applyToRoutes Configure which routes to apply the policy to, by using labels. The label matches the app and the route from the route table. If omitted, the policy applies to all routes in the workspace.
spec.config.delay Because no percentage field is set, the policy defaults to delaying 100% of requests. The fixedDelay field is required, and sets the duration in seconds to delay the request.

The following example is for a fault injection policy that both delays and aborts requests. Delays and aborts are independent of one another. When both are set, both happen, with the delay happening first.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: FaultInjectionPolicy
metadata:
  name: faultinjection-basic-abort-and-delay
  namespace: bookinfo
  annotations:
    cluster.solo.io/cluster: $REMOTE_CLUSTER1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings
  config:
    abort:
      httpStatus: 418
      percentage:
        value: 10
    delay:
      percentage:
        value: 40
      fixedDelay: 5s

Review the following table to understand this configuration. For more information, see the API docs.

Setting Description
spec.applyToRoutes Configure which routes to apply the policy to, by using labels. The label matches the app and the route from the route table. If omitted, the policy applies to all routes in the workspace.
spec.config.abort The httpStatus field sets the HTTP status code to return for an aborted request, such as 418. The value must be an integer in the range [200, 600]. For HTTP response status codes, see the mdn web docs. The percentage field is set to 10, so 10% of the requests are aborted. If the request is also chosen for a delay, the delay happens before the request is aborted.
spec.config.delay The fixedDelay field is required, and sets the duration in seconds to delay the request. The percentage field is set to 40, so 40% of the requests are delayed. If the request is also chosen to be aborted, the delay happens before the request is aborted.

Verify fault injection policies

  1. Create the example fault injection policy for the ratings app.
    kubectl apply --context ${REMOTE_CONTEXT1} -f - << EOF
    apiVersion: resilience.policy.gloo.solo.io/v2
    kind: FaultInjectionPolicy
    metadata:
      annotations:
        cluster.solo.io/cluster: ""
      name: faultinjection-basic
      namespace: bookinfo
    spec:
      applyToRoutes:
      - route:
          labels:
            route: ratings
      config:
        abort:
          httpStatus: 418
    EOF
    
  2. Create a route table for the ratings app. Because the policy applies at the route level, Gloo checks for the route in a route table resource.
    kubectl apply --context ${REMOTE_CONTEXT1} -f - << EOF
    apiVersion: networking.gloo.solo.io/v2
    kind: RouteTable
    metadata:
      name: ratings-rt
      namespace: bookinfo
    spec:
      hosts:
      - ratings
      http:
      - forwardTo:
          destinations:
          - ref:
              name: ratings
              namespace: bookinfo
        labels:
          route: ratings
      workloadSelectors:
      - {}
    EOF
    

    Review the following table to understand this configuration. For more information, see the API docs.

    Setting Description
    hosts The host that the route table routes traffic for. In this example, the ratings host matches the ratings service within the mesh.
    http.forwardTo.destinations The destination to forward requests that come in along the host route. In this example, the ratings service is selected.
    http.labels The label for the route. This label must match the label that the policy selects.
    workloadSelectors The source workloads within the mesh that this route table routes traffic for. In the example, all workloads are selected. This way, the curl container that you create in subsequent steps can send a request along the ratings route.
  3. Send a request to the ratings app from within the mesh.

    Create a temporary curl pod in the bookinfo namespace, so that you can test the app setup. You can also use this method in Kubernetes 1.23 or later, but an ephemeral container might be simpler, as shown in the other tab.

    1. Create the curl pod.

      kubectl run -it -n bookinfo --context $REMOTE_CONTEXT1 curl \
        --image=curlimages/curl:7.73.0 --rm  -- sh
      
    2. Send a request to the ratings app.

      curl http://ratings:9080/ratings/1 -v
      

      Example output:

      HTTP/1.1 418 Unknown
      ...
      * Connection #0 to host ratings left intact
      fault filter abort
      
    3. Exit the temporary pod. The pod deletes itself.

      exit
      

    Use the kubectl debug command to create an ephemeral curl container in the deployment. This way, the curl container inherits any permissions from the app that you want to test. If you don't run Kubernetes 1.23 or later, you can deploy a separate curl pod or manually add the curl container as shown in the other tab.

    kubectl --context ${REMOTE_CONTEXT1} -n bookinfo debug -i pods/$(kubectl get pod --context ${REMOTE_CONTEXT1} -l app=reviews -A -o jsonpath='{.items[0].metadata.name}') --image=curlimages/curl -- curl -v http://ratings:9080/ratings/1
    

    Example output:

    HTTP/1.1 418 Unknown
    

    If the output has an error about EphemeralContainers, see Ephemeral containers don’t work when testing Bookinfo.

  4. Verify that you notice the fault from the previous examples.
    • Abort: All inbound requests to the ratings service result in a 418 Unknown HTTP status code.
    • Delay: All inbound requests to the ratings service have a five second delay.
    • Both abort and delay: 10% of the calls return 418 Unknown HTTP status code responses, and 40% have a five second delay before they send a response.
  5. Optional: Clean up the resources that you created.
    kubectl --context $REMOTE_CONTEXT1 -n bookinfo delete routetable ratings-rt
    kubectl --context $REMOTE_CONTEXT1 -n bookinfo delete faultinjectionpolicy faultinjection-basic