Solo runs internal scalability tests for every release to confirm that the translation and user experience times remain within expected boundaries, and to measure scalability and performance improvements between releases.

This page summarizes factors that impact Gloo Mesh Enterprise scalability and recommendations for how to improve scalability and performance in large-scale environments. To learn about the scalability tests that Solo runs, see Internal scalability tests.

Scalability threshold definition

Gloo Mesh Enterprise is considered performant when Gloo resources that the user creates are translated into Envoy and Istio resources, and are applied in the user’s environment in a reasonable amount of time. The following image illustrates the Gloo components that are included during the translation (A to B) and reconciliation process (A to C).

Figure: Gloo Mesh Enterprise resource translation and reconciliation for resources applied in workload clusters
Figure: Gloo Mesh Enterprise resource translation and reconciliation for resources applied in workload clusters

A scalability threshold is reached when one of the following conditions are met:

  • Translation time too high: The time it takes from applying a Gloo resource in the cluster (A) to generating the output snapshot (B) is greater than 60 seconds.
  • User experience time too high: The time it takes from applying a Gloo resource in the cluster (A) to propagating the user’s config changes to the workload’s sidecar proxy or an Istio gateway (C) is greater than 120 seconds.
  • Gloo management server unavailable: The Gloo management server becomes unavailable or crashes, even though you provided enough compute resources.

Factors that impact the scalability threshold and recommendations

Review the factors that impact the scalability threshold in Gloo Mesh Enterprise. By accounting for these factors in your environment setup, you can optimize the Gloo Mesh Enterprise scalability and performance.

Workspace boundaries

Workspaces define the cluster resources that users have access to. These resources can be spread across multiple Kubernetes namespaces and clusters. To allow services within a workspace to communicate with each other across namespaces and clusters, Gloo Mesh Enterprise automatically translates the Gloo resources into Istio resources and applies them to the namespace and cluster where they are needed. This process is also referred to as federation.

The complexity of this setup increases even more when you choose to import and export Gloo resources in your workspace to allow services to communicate across workspaces. To learn more about how resources are federated in your workspace when you import from and export to other workspaces, see Import and export resources across workspaces.

The more services a workspace includes, or imports from and exports to, the more Istio resources must be added in each of your clusters. As resources scale up into the thousands, the Gloo Mesh Enterprise scalability threshold might be reached more quickly. To optimize Gloo Mesh Enterprise scalability, define proper workspace boundaries. This way, Gloo only federates the necessary resources across Kubernetes namespaces and clusters.

  • ✅ Use multiple, smaller workspaces that include only the services that a team has access to.
  • ✅ Export and import only the services that you need access to.
  • 🟡 Avoid having a global workspace that selects all Kubernetes clusters and namespaces.
  • 🟡 Reduce the amount of cross-workspace traffic required in multicluster environments.

Multicluster routing with virtual destinations and external services

To enable routing between services cluster boundaries, Istio resources must be federated across Kubernetes namespaces and clusters. You have the option to enable service federation in your workspace settings. With this setting, Gloo automatically creates Istio ServiceEntries in every namespace of each cluster. Depending on the number of services you have in your environment and the number of services that you export to and import from other workspaces, the number of ServiceEntries can quickly grow and impact the Gloo Mesh Enterprise scalability threshold in your environment.

Instead of enabling federation for your entire workspace, use Gloo virtual destinations and external services to enable intelligent, multicluster routing for the services that are selected by these resources. Gloo Mesh Enterprise retrieves the services that the virtual destination and external services select, and automatically federates them in each namespace within the workspace, across clusters, and even within other workspaces if you set up importing and exporting. With this setup, you federate only the services that must be federated across cluster. For more information about federation, see Federated services

  • ✅ Use virtual destinations when you want to access services that are located in other clusters.
  • ✅ Export and import only the services that you need access to in a particular workspace.
  • 🟡 Avoid wildcards to import or export all services in a workspace.
  • 🟡 Avoid using federation for all services at the workspace level.

Output snapshot size

When Gloo resources are created, the Gloo agent creates an input snapshot that includes the Gloo resources that must be translated. This input snapshot is sent to the Gloo management server for translation. After the Gloo management server translated the Gloo resources into the corresponding Istio resources, an output snapshot is sent back to the Gloo agents. The agents use this snapshot to apply the Istio resources in the workload clusters.

The number of services that belong to a workspace and the number of services that are imported and exported from other workspaces impact the size of the output snapshot that the Gloo management server creates.

Internal scalability tests show that the Gloo Mesh Enterprise scalability threshold is reached more quickly if the snapshot size is greater than 20 MB. As such, define proper workspace boundaries to reduce the number of services that you import from and export to other workspaces. Scoping a workspace to keep the number of services small usually results in smaller output snapshot sizes.

  • ✅ Use multiple, smaller workspaces that include only the services that a team has access to.
  • ✅ Export and import only the services that you need access to.
  • 🟡 Avoid having a global workspace that selects all Kubernetes clusters and namespaces.

Gloo management server compute resources

The Gloo management server’s compute resource consumption varies depending on the changes that must be applied in the Gloo Mesh Enterprise environment. For example, the resource consumption is high when a change occurs and the management server must translate the resource and propagate the change to the gateways and workload proxies. However, if no changes occur in the environment, the CPU and memory resources that are allocated to the Gloo management server are usually underutilized.

To find an overview of the minimum resource requirements, see System requirements for size and memory.

Number of clusters

The number of clusters that you add to your Gloo environment impacts the number of Gloo agents that need to be kept in sync and the time it takes to properly propagate changes in your environment. If you find that the reconciliation time (A to C) is continuously above 120s, you can try to scale the Gloo management server pod to multiple replicas.

Internal scalability tests

Solo runs internal scalability tests for every release to verify translation and user experience times, measure scalability and performance improvements between releases, and to confirm scalability in emulated customer environments.

You can review the results of a sample scalability test that was run by Solo engineers. The test results are meant to show the scalability that was achieved and verified for Gloo Mesh in a given test setup with a predefined set of test data and workloads.

Gloo Mesh Enterprise version

Tests were performed on Gloo Mesh version 2.2.5.

Test environment setup

The following compute resources were used to run the scalability test.

Compute resourceUnit
Management server pod CPU4 vCPUs
Management server pod memory8Gi
Management server pod replica count1
Agent pod CPU2 vCPUs
Agent pod memory2Gi
Agent pod replica count1
Istiod pod CPU2 vCPUs
Istiod memory2Gi
Number of management clusters1
Number of workload clusters3
Node config for management cluster2 nodes with 8 vCPU and 32Gi memory
Node config for workload clusters4 nodes with 4 vCPU and 16Gi memory

Load increments

During the scalability test, load was added to the test environment in increments as shown in the following table.

ResourceAmount
Namespaces3
Workspaces6
Workspace connections (number of workspaces to export to/ import from)2
Workloads (Kubernetes Services) per namespace4
Total number of workloads (multiply the namespaces, workspaces, and workloads)72
Route tables1 per total workload
Virtual destinations1 per total workload
Header manipulation1 per workspace
Transformation policy1 per workspace

Test procedure

The scalability test comprised increasing the load and total number of Gloo resources as defined in the load increments. After each load increment, the performance of Gloo Mesh was measured. Tests were stopped when one of the scalability thresholds was reached as defined in the Gloo Mesh scalability threshold definition.

Test results

The following graphs illustrate the scalability that was achieved in Gloo Mesh during the scalability tests before the translation time threshold (B) was reached.

Figure: Scalability results
Figure: Scalability results

Refer to the following tables to find a detailed overview of the number of workloads, Gloo Mesh resources, translation and user experience times that were achieved during the scalability tests. These numbers are represented in the graphs.