Fault injection

Test the resilience of your apps by injecting delays and connection failures.

Inject faults in a percentage of your requests to test how your app handles the errors. By using the policy, you can avoid deleting pods, delaying packets, or corrupting packets.

You can set two types of faults injection:

For more information, see the following resources.

Before you begin

  1. Complete the demo setup to install Gloo Mesh, Istio, and Bookinfo in your cluster.

  2. Create the Gloo Mesh resources for this policy in the management and workload clusters.

    The following files are examples only for testing purposes. Your actual setup might vary. You can use the files as a reference for creating your own tests.

    1. Download the following Gloo Mesh resources:
    2. Apply the files to your management cluster.
      kubectl apply -f kubernetes-cluster_gloo-mesh_cluster-1.yaml --context ${MGMT_CONTEXT}
      kubectl apply -f kubernetes-cluster_gloo-mesh_cluster-2.yaml --context ${MGMT_CONTEXT}
      kubectl apply -f workspace_gloo-mesh_anything.yaml --context ${MGMT_CONTEXT}
      
    1. Download the following Gloo Mesh resources:
    2. Apply the files to your workload cluster.
      kubectl apply -f route-table_bookinfo_www-example-com.yaml --context ${REMOTE_CONTEXT1}
      kubectl apply -f virtual-gateway_bookinfo_north-south-gw.yaml --context ${REMOTE_CONTEXT1}
      kubectl apply -f workspace-settings_bookinfo_anything.yaml --context ${REMOTE_CONTEXT1}
      
  3. Send a request to verify that you can reach the ratings app. If not, try Debugging your route.

    curl -vik --resolve www.example.com:443:${INGRESS_GW_IP} https://www.example.com:443/ratings/1
    

    Example output:

    HTTP/1.1 200 OK
    ...
    {"id":1,"ratings":{"Reviewer1":5,"Reviewer2":4}}
    

Configure fault injection policies

You can apply a fault injection policy at the route level. For more information, see Applying policies.

Review the following sample configuration files.

The following example is for a simple fault injection abort policy with a default value for the percentage. No delay is configured.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: FaultInjectionPolicy
metadata:
  name: faultinjection-basic
  namespace: bookinfo
spec:
  applyToRoutes:
  - route:
      labels:
        route: ratings
  config:
    abort:
      httpStatus: 418

Review the following table to understand this configuration.

Setting Description
spec.applyToRoutes Configure which routes to apply the policy to, by using labels. The label matches the app and the route from the route table. If omitted, the policy applies to all routes in the workspace.
spec.config.abort Because no percentage field is set, the policy defaults to aborting 100% of requests. The httpStatus field sets the HTTP status code to return for an aborted request, such as 418. The value must be an integer in the range [200, 600]. For HTTP response status codes, see the mdn web docs.

The following example is for a simple fault injection delay policy with a default value for the percentage. No abort is configured.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: FaultInjectionPolicy
metadata:
  name: faultinjection-basic-delay
  namespace: bookinfo
  clusterName: cluster-1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings
    delay:
      fixedDelay: 5s

Review the following table to understand this configuration.

Setting Description
spec.applyToRoutes Configure which routes to apply the policy to, by using labels. The label matches the app and the route from the route table. If omitted, the policy applies to all routes in the workspace.
spec.config.delay Because no percentage field is set, the policy defaults to delaying 100% of requests. The fixedDelay field is required, and sets the duration in seconds to delay the request.

The following example is for a fault injection policy that both delays and aborts requests. Delays and aborts are independent of one another. When both are set, both happen, with the delay happening first.

apiVersion: resilience.policy.gloo.solo.io/v2
kind: FaultInjectionPolicy
metadata:
  name: faultinjection-basic-abort-and-delay
  namespace: bookinfo
  clusterName: cluster-1
spec:
  applyToRoutes:
    - route:
        labels:
          route: ratings
  config:
    abort:
      httpStatus: 418
      percentage:
        value: 10
    delay:
      percentage:
        value: 40
      fixedDelay: 5s

Review the following table to understand this configuration.

Setting Description
spec.applyToRoutes Configure which routes to apply the policy to, by using labels. The label matches the app and the route from the route table. If omitted, the policy applies to all routes in the workspace.
spec.config.abort The httpStatus field sets the HTTP status code to return for an aborted request, such as 418. The value must be an integer in the range [200, 600]. For HTTP response status codes, see the mdn web docs. The percentage field is set to 10, so 10% of the requests are aborted. If the request is also chosen for a delay, the delay happens before the request is aborted.
spec.config.delay The fixedDelay field is required, and sets the duration in seconds to delay the request. The percentage field is set to 40, so 40% of the requests are delayed. If the request is also chosen to be aborted, the delay happens before the request is aborted.

Verify fault injection policies

  1. Apply the example fault injection policy in the cluster with the Bookinfo workspace in your example setup.
    kubectl apply --context ${REMOTE_CONTEXT1} -f fault-injection.yml
    
  2. Send a request to the ratings app through the ingress gateway.
    curl -vik --resolve www.example.com:443:${INGRESS_GW_IP} https://www.example.com:443/ratings/1
    
  3. Verify that you notice the fault from the previous examples.
    • Abort: All inbound requests to the ratings service result in a 418 Unknown HTTP status code.
    • Delay: All inbound requests to the ratings service have a five second delay.
    • Both abort and delay: 10% of the calls return 418 Unknown HTTP status code responses, and 40% have a five second delay before they send a response.