High availability and disaster recovery

Review options for high availability and disaster recovery of the Gloo Platform management plane. For more information about the management plane components, and how the management plane and data plane communicate, see Relay architecture.

About

High availability (HA) is a key requirement for large enterprises that must prepare for potential regional outages. While time and effort is often prioritized for application uptime and availability in the data plane, regional HA and disaster recovery for the management plane is often overlooked. However, regional outages or disasters that affect your management plane can cause interruptions in the rest of your Gloo setup. For example, if your management plane is unavailable, it might delay the propagation of Gloo traffic policies that determine the failover behavior of applications.

To increase the resiliency of your Gloo Platform management plane, you can use horizontal replica scaling and deploy multiple management clusters. Note that you can use both of these methods together for HA. For example, you can create two management clusters, and deploy mulitple replicas of the management server pods in each management cluster.

Horizontal replica scaling

In Gloo Platform 2.1.0 and later, horizontal replica scaling enables both resilience and distributed scaling by increasing the replica count of the management server deployment within one Kubernetes cluster. Each replica handles a subset of connected agents in workload clusters, providing scalability. If a replica pod fails, the agents that were connected to that management server replica automatically connect to a different replica. The management server deployment can span multiple availability zones (depending on the hosting topology), which can provide resilience. Additionally, you can enable HA at the level of availability zones by ensuring that replicas are deployed to worker nodes that run in different zones.

When you have multiple replicas, the translation load is distributed between them according to the workload agents connected to each replica. In a multicluster environment, this can significantly improve translation time. However, note that because each management server replica must have access to the entire snapshot of Gloo and discovered resources, resource usage (in particular memory) increases overall with multiple replicas.

Additionally, the number of agent connections per management server replica might not be evenly distributed. For example, if you have two management server replicas and three registered workload clusters, the first management server replica might generate the translation snapshot for two workload agents while the second replica generates the snapshot for only one agent. To attempt to balance the number of agents across your management server replicas, you can enable the experimental setting glooMeshMgmtServer.enableClusterLoadBalancing. However, due to statistical tolerance, a setup with a small number of workload clusters (such as three) might not result in an exactly even split of 1:1:1 agents across the three replicas.

In the following diagram, three workload cluster agents are connected to two management server replicas. One replica handles the translation for two agents, and the other replica handles the translation for one.

Figure: Relay agents on workload clusters are registered with multiple replicas of the Gloo management server on the management cluster.

To deploy multiple replicas of your management server, include the following Helm settings in your Gloo Platform management plane installation:

glooMgmtServer:
  deploymentOverrides:
    spec:
      # Required. Increase as needed.
      replicas: 2
  # Required. Mark one replica as the active or leader replica.
  leaderElection: true
  # Optional (experiemental). Attempt to evenly load balance the number of
  # connected Gloo agents across the number of management server replicas.
  enableClusterLoadBalancing: true

Multiple management clusters

Support for multiple management clusters is a beta feature available in Gloo Platform 2.4.0 and later. Beta features might change and are not supported for production. For more information, see Gloo feature maturity.

The following information is applicable to multicluster setups in which you deploy the Gloo management components to a dedicated cluster, and the Gloo agent to one or more workload clusters. For more information about how the management plane and data plane communicate in a multicluster setup, see Relay architecture.

In Gloo Platform 2.4.0 and later, you can add redundancy by deploying a Gloo management server in multiple Kubernetes clusters. For example, you might create two management clusters, and install the Gloo Platform management components in each cluster. You configure the management server in one management cluster as active, and one or more management servers that are deployed in other clusters for passive standby. In the event that all replicas of the first management server becomes unavailable, another standby management server can seamlessly become the active server.

Setup

When you set up Gloo Mesh, you set up two or more Gloo management clusters in different regions.

The following diagram shows the architecture of a multicluster management plane setup.

Figure: Overview of a multicluster management plane. Workload agents connect to the active management server, which writes all Gloo configuration to the shared Redis instance.

Review the following decision points for the management server and workload agent installation settings that are required in a multicluster management plane.

When you install the Gloo Platform management components into your clusters, do not also register them as workload clusters by deploying a Gloo agent, which can cause issues during the failover process.

Redis

The Gloo Platform management server uses Redis to store translated resources as snapshots for all connected agents in workload clusters. To support multiple management servers, you supply a shared Redis solution from which all management servers can read. For example, you might use AWS Global Elasticache to provide a two-region Redis configuration with a primary and secondary Redis endpoint. The secondary endpoint is read-only, until it is promoted to become the primary endpoint.

Because all management servers use the same Redis store, configurations are always immediately consistent, even if all agents are not yet reconnected after a failover or if the configuration change volume is high. Additionally, if you create Gloo resources on workload clusters instead of management clusters, the shared Redis store allows each management server to access the same processed configurations that are collected from all workload clusters.

Installation settings:

  1. Choose a Redis solution that meets the following requirements.
    • All management servers must be able to read from the Redis store.
    • The Redis solution must have the capability to switch which management server endpoint is able to write to the store. For example, in Elasticache, only one endpoint can write to the store at any given time, so you must be able to switch to the endpoint for the currently-active server.
  2. Follow the steps in Bring your own Redis to use your shared Redis instance in the Helm settings for the management plane installations.

Load balancer, DNS pointer, or IP pointer

Provide a load balancer, DNS pointer, or IP pointer that resolves to the currently active management server, which agents in the workload clusters connect to. For example, you might use an AWS Route53 with automatic failover. You add all of the management servers’ IP addresses to the route, but the route resolves to the IP address of only the management server that you want to mark as active. When you register workload clusters with the management plane, you specify the route address that you chose, such as mgmt-server.example.com, instead of an individual IP address for one management server. If the active server fails, you change this load balancer, DNS pointer, or IP pointer to instead resolve to the IP address of another management server that you want to use.

Installation settings:

  1. Create a load balancer, DNS pointer, or IP pointer solution that meets the following requirements.
    • Resolve the DNS record to the management server that you want to serve as the active server.
    • Do not balance the load across the management clusters (active-active). Currently, the multicluster management plane is supported for active-passive scenarios in which the agents connect to only one management server at a time.
  2. When you register each workload cluster with the Gloo management server, specify the address for your solution in the glooAgent.relay.serverAddress field.

Metrics

In the default OpenTelemetry (OTel) metrics pipeline for your Gloo setup, a telemetry gateway is deployed to each management cluster, and telemetry collector agents are deployed to each workload cluster. To ensure that the collector agents can send metrics to each telemetry gateway, you define one exporter in the collector settings for each telemetry gateway address.

Installation settings: In the Helm values file that you create for the agent in step 10 of that guide, be sure to include one exporter for each OTel gateway address, such as in the following example:

telemetryCollector:
  config:
    exporters:
      otlp:
        # Address of gateway in mgmt cluster 1
        # The default port is 4317.
        # If you set up an external load balancer between the telemetry gateway and collector agents, and you configured TLS passthrough to forward data to the telemetry gateway on port 4317, use port 443 instead.
        endpoint: [domain]:4317
        tls:
          ca_file: /etc/otel-certs/ca.crt
          server_name_override: gloo-telemetry-gateway.gloo-mesh
      otlp/2:
        # Address of gateway in mgmt cluster 2
        # The default port is 4317.
        # If you set up an external load balancer between the telemetry gateway and collector agents, and you configured TLS passthrough to forward data to the telemetry gateway on port 4317, use port 443 instead.
        endpoint: [domain]:4317
        tls:
          ca_file: /etc/otel-certs/ca.crt
          server_name_override: gloo-telemetry-gateway.gloo-mesh
      ...

Relay certificates

For a multicluster management plane, you must provide certificates to secure the agent-management server connection. Do not use insecure relay connections, or the Gloo Gateway self-signed CAs that are deployed by default. Instead, explore the relay certificate setup options where you bring your own certificates for the Gloo mangement server and agent. When you bring your own certificates for the relay connection, you must also bring your own certificates for the Gloo telemetry pipeline.

Installation settings: Decide how you want to secure the relay connection between the Gloo management server and agent, and follow the setup steps. You can choose between the following options:

CI/CD pipeline configuration

As a recommended best practice for Gloo Mesh, write all Gloo resources to management clusters, not workload clusters. In multicluster management plane setups, you can either write your configuration to all management clusters, or to one management cluster.

Write configuration to all management clusters (recommended): To support configuration consistency during and after a failover, write your Gloo cofiguration to all managemnent clusters in your setup. For example, you might use simple CI pipelines that write to passive and active management clusters in serial, or use GitOps tooling such as ArgoCD or Flux that implement a pull model to keep Kubernetes resources in sync across cluster fleets. Writing to all management clusters is recommended so that you do not need to change your CI/CD pipeline in the case of a management server failure.

Write configuration to one management cluster: You might have certain restrictions that require your CI/CD pipeline to write to only one management cluster at a time. If you write configuration to only the active management cluster, during a failure event, you must first run the meshctl experimental switch-active command to switch the Redis global lock so that the newly-active server can write to the Redis store. Then, you must manually change your CI/CD pipeline to write to the newly-active management server. Additionally, when you install the Gloo management plane in each management cluster, you must be sure to use the same name for each management cluster in the common.cluster Helm setting. This ensures that any resources that might require you to specify the management cluster name, such as the Workspace Gloo CR, can be written to any management cluster and still be correctly applied. For these reasons, writing to one cluster can be a more difficult process than writing to all management clusters.

Failover process

Switch to another management server by following these steps.

  1. Verify the issue. In the event of a failure, you might notice an issue from various sources.

    • You might notice that your entire management cluster is experiencing a regional outage based on your cloud provider's alerts.
    • You might see bad responses in your DNS solution from the unavailable server's IP address.
    • To verify an issue with the active management server, you can use the Gloo operations dashboard. For example, you might check your Gloo Platform alerts and metrics to detect translation or connection errors and disconnected agents.
    • To check the connection between your management cluster and workload agents, you can run meshctl check --context $MGMT_CONTEXT.
  2. Switch the DNS record. Change your DNS solution to resolve to the IP address of a different, healthy management server. When you update the address resolution, the workload agents automatically connect to the new management server, without needing to be restarted. Because the newly active management server reads from the shared Redis store, it gets the latest configuration that the previous management server wrote. You can find the IP address for the management server that you need to switch to by running the following command.

    export MGMT_SERVER_NETWORKING_DOMAIN=$(kubectl get svc -n gloo-mesh gloo-mesh-mgmt-server --context $MGMT_CONTEXT -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    export MGMT_SERVER_NETWORKING_PORT=$(kubectl -n gloo-mesh get service gloo-mesh-mgmt-server --context $MGMT_CONTEXT -o jsonpath='{.spec.ports[?(@.name=="grpc")].port}')
    export MGMT_SERVER_NETWORKING_ADDRESS=${MGMT_SERVER_NETWORKING_DOMAIN}:${MGMT_SERVER_NETWORKING_PORT}
    echo $MGMT_SERVER_NETWORKING_ADDRESS
    
    Figure: Switch DNS record to resolve to the IP address of the newly-active management server, which workload agents seamlessly connect to.
  3. Writing to one cluster only: Switch Redis and your CI/CD pipeline. If you write your Gloo configuration to only one cluster at a time:

    1. Switch the Redis global lock. Run the following command to allow the newly-active server to write to the Redis store. Specify the context and name of the management cluster that you want to switch to, and the address for the shared Redis instance. Be sure to use the same cluster name and Redis address that you specified during the management plane installation time. Note that the switch might take a few seconds to take effect. After you run this command, this management server now writes to Redis, and serves as the active server as long as the previous management server is unavailable. Additionally, because this management server was already connected to the shared Redis, it has the latest configuration that your previous management server wrote. For more information, see the CLI command reference.

      meshctl experimental switch-active \
        --kubecontext <new_cluster_context> \
        --mgmt-cluster-name <new_cluster_name> \
        --redis-address <shared_redis_address>
      
      Figure: Switch the Redis global lock so that the newly-active management server writes all Gloo configuration to Redis.
    2. Switch your CI/CD pipeline. Be sure to write your latest Gloo configuration to the newly-active server by switching your CI/CD pipeline to the newly active cluster. For any new Gloo configuration that the pipeline writes to the cluster, the management server writes it to the shared Redis.

  4. Switch back to the original management server. After the original management server recovers and you verify its health, make the transition back to that server by following the same steps again.

    1. Switch the DNS record back to the original management server, and verify that the agents can successfully connect to the server by running commands such as meshctl check.
    2. If you write Gloo configuration to one cluster only, switch the Redis global lock by re-running the meshctl x switch-active command with your original management cluster's name and the same Redis address, and switch your CI/CD pipeline back to the original cluster.
    Figure: Switch the DNS record, Redis global lock, and if applicable, your CI/CD pipeline back to the original management server.