Gloo Mesh scalability

Solo runs internal scalability tests for every release to confirm that the translation and user experience times remain within expected boundaries, and to measure scalability and performance improvements between releases.

This page summarizes factors that impact Gloo Mesh scalability and recommendations for how to improve scalability and performance in large-scale environments. To learn about the scalability tests that Solo runs, see Internal scalability tests.

If you are interested in validating the scalability of Gloo Mesh components in your environment or if you need help with optimizing the performance, contact your account representative. Solo engineers can help with replicating your production environment and running performance tests to verify the maximum number of workloads and resources that Gloo Mesh supports for your use case.

On this page:

Gloo Mesh scalability threshold definition

Gloo Mesh Enterprise is considered performant when Gloo resources that the user creates are translated into Envoy and Istio resources, and are applied in the user's environment in a reasonable amount of time. The following images illustrate the components that are involved during the Gloo resource translation (A to B) and reconciliation (A to C) process when these resources are applied in the workload or management cluster.

Figure: Gloo Mesh resource translation and reconciliation for resources applied in workload clusters
Figure: Gloo Mesh resource translation and reconciliation for resources applied in the management cluster

A scalability threshold is reached when one of the following conditions are met:

Factors that impact the scalability threshold and recommendations

Review the factors that impact the scalability threashold in Gloo Mesh. By accounting for these factors in your environment setup, you can optimize the Gloo Mesh scalability and performance.

Workspace boundaries

Workspaces define the cluster resources that users have access to. These resources can be spread across multiple Kubernetes namespaces and clusters. To allow services within a workspace to communicate with each other across namespaces and clusters, Gloo Mesh automatically translates the Gloo resources into Istio resources and applies them to the namespace and cluster where they are needed. This process is also referred to as federation.

The complexity of this setup increases even more when you choose to import and export Gloo resources in your workspace to allow services to communicate across workspaces. To learn more about how resources are federated in your workspace when you import from and export to other workspaces, see Import and export resources across workspaces.

The more services a workspace includes, or imports from and exports to, the more Istio resources must be added in each of your clusters. As resources scale up into the thousands, the Gloo Mesh scalability threshold might be reached more quickly. To optimize Gloo Mesh scalability, define proper workspace boundaries. This way, Gloo only federates the necessary resources across Kubernetes namespaces and clusters.

Multicluster routing with virtual destinations and external services

To enable routing between services cluster boundaries, Istio resources must be federated across Kubernetes namespaces and clusters. You have the option to enable service federation in your workspace settings. With this setting, Gloo automatically creates Istio ServiceEntries in every namespace of each cluster. Depending on the number of services you have in your environment and the number of services that you export to and import from other workspaces, the number of ServiceEntries can quickly grow and impact the Gloo Mesh scalability threshold in your environment.

Instead of enabling federation for your entire workspace, use Gloo virtual destinations and external services to enable intelligent, multicluster routing for the services that are selected by these resources. Gloo Mesh retrieves the services that the virtual destination and external services select, and automatically federates them in each namespace within the workspace, across clusters, and even within other workspaces if you set up importing and exporting. With this setup, you federate only the services that must be federated across cluster. For more information about federation, see Federated services

Output snapshot size

When Gloo resources are created, the Gloo agent creates an input snapshot that includes the Gloo resources that must be translated. This input snapshot is sent to the Gloo management server for translation. After the Gloo management server translated the Gloo resources into the corresponding Istio resources, an output snapshot is sent back to the Gloo agents. The agents use this snapshot to apply the Istio resources in the workload clusters.

The number of services that belong to a workspace and the number of services that are imported and exported from other workspaces impact the size of the output snapshot that the Gloo management server creates.

Internal scalability tests show that the Gloo Mesh scalability threshold is reached more quickly if the snapshot size is greater than 20 MB. As such, define proper workspace boundaries to reduce the number of services that you import from and export to other workspaces. Scoping a workspace to keep the number of services small usually results in smaller output snapshot sizes.

Gloo management server compute resources

The Gloo management server's compute resource consumption varies depending on the changes that must be applied in the Gloo Mesh environment. For example, the resource consumption is high when a change occurs and the management server must translate the resource and propagate the change to the gateways and workload proxies. However, if no changes occur in the environment, the CPU and memory resources that are allocated to the Gloo management server are usually underutilized.

The recommended compute resources for the Gloo management server are 8 vCPUs and 16Gi of memory. However, 4 vCPUs and 8Gi of memory might be enough in test or pre-prod environments where performance is secondary. If you find that the Gloo management server translation time is continuously above 60s in your environment, you can try to improve the performance by allocating more CPU and memory resources to the Gloo management server.

To find an overview of the minimum resource requirements, see System requirements

Number of clusters

The number of clusters that you add to your Gloo environment impacts the number of Gloo agents that need to be kept in sync and the time it takes to properly propagate changes in your environment. If you find that the reconciliation time (A to C) is continuously above 120s, you can try to scale the Gloo management server pod to multiple replicas.

Internal scalability tests

Solo runs internal scalability tests for every release to verify translation and user experience times, measure scalability and performance improvements between releases, and to confirm scalability in emulated customer environments.

To access the results of a sample scalability test that was run by Solo engineers, log in to Zendesk and review the Gloo Mesh Enterprise internal scalability tests article.