Scalability testing

Solo runs internal scalability tests for every release to confirm that the translation and user experience times remain within expected boundaries, and to measure scalability and performance improvements between releases.

This page summarizes factors that impact Gloo Mesh scalability and recommendations for how to improve scalability and performance in large-scale environments. To learn about the scalability tests that Solo runs, see Internal scalability tests.

If you are interested in validating the scalability of Gloo Mesh components in your environment or if you need help with optimizing the performance, contact your account representative. Solo engineers can help with replicating your production environment and running performance tests to verify the maximum number of workloads and resources that Gloo Mesh supports for your use case.

On this page:

Gloo Mesh scalability threshold definition

Gloo Mesh Enterprise is considered performant when Gloo resources that the user creates are translated into Envoy and Istio resources, and are applied in the user's environment in a reasonable amount of time. The following image illustrates the Gloo components that are included during the translation (A to B) and reconciliation process (A to C).

Gloo Mesh resource translation and reconciliation process.

A scalability threshold is reached when one of the following conditions are met:

Factors that impact the scalability threshold and recommendations

Review the factors that impact the scalability threashold in Gloo Mesh. By accounting for these factors in your environment setup, you can optimize the Gloo Mesh scalability and performance.

Workspace boundaries

Workspaces define the cluster resources that users have access to. These resources can be spread across multiple Kubernetes namespaces and clusters. To allow services within a workspace to communicate with each other across namespaces and clusters, Gloo Mesh automatically translates the Gloo resources into Istio resources and applies them to the namespace and cluster where they are needed. This process is also referred to as federation.

The complexity of this setup increases even more when you choose to import and export Gloo resources in your workspace to allow services to communicate across workspaces. To learn more about how resources are federated in your workspace when you import from and export to other workspaces, see Import and export resources across workspaces.

The more services a workspace includes, or imports from and exports to, the more Istio resources must be added in each of your clusters. As resources scale up into the thousands, the Gloo Mesh scalability threshold might be reached more quickly. To optimize Gloo Mesh scalability, define proper workspace boundaries. This way, Gloo only federates the necessary resources across Kubernetes namespaces and clusters.

Multicluster routing with virtual destinations and external services

To enable routing between services cluster boundaries, Istio resources must be federated across Kubernetes namespaces and clusters. You have the option to enable service federation in your workspace settings. With this setting, Gloo automatically creates Istio ServiceEntries in every namespace of each cluster. Depending on the number of services you have in your environment and the number of services that you export to and import from other workspaces, the number of ServiceEntries can quickly grow and impact the Gloo Mesh scalability threshold in your environment.

Instead of enabling federation for your entire workspace, use Gloo virtual destinations and external services to enable intelligent, multicluster routing for the services that are selected by these resources. Gloo Mesh retrieves the services that the virtual destination and external services select, and automatically federates them in each namespace within the workspace, across clusters, and even within other workspaces if you set up importing and exporting. With this setup, you federate only the services that must be federated across cluster. For more information about federation, see Federated services

Output snapshot size

When Gloo resources are created, the Gloo agent creates an input snapshot that includes the Gloo resources that must be translated. This input snapshot is sent to the Gloo management server for translation. After the Gloo management server translated the Gloo resources into the corresponding Istio resources, an output snapshot is sent back to the Gloo agents. The agents use this snapshot to apply the Istio resources in the workload clusters.

The number of services that belong to a workspace and the number of services that are imported and exported from other workspaces impact the size of the output snapshot that the Gloo management server creates.

Internal scalability tests show that the Gloo Mesh scalability threshold is reached more quickly if the snapshot size is greater than 20 MB. As such, define proper workspace boundaries to reduce the number of services that you import from and export to other workspaces. Scoping a workspace to keep the number of services small usually results in smaller output snapshot sizes.

Gloo management server compute resources

The Gloo management server's compute resource consumption varies depending on the changes that must be applied in the Gloo Mesh environment. For example, the resource consumption is high when a change occurs and the management server must translate the resource and propagate the change to the gateways and workload proxies. However, if no changes occur in the environment, the CPU and memory resources that are allocated to the Gloo management server are usually underutilized.

The recommended compute resources for the Gloo management server are 8 vCPUs and 16Gi of memory. However, 4 vCPUs and 8Gi of memory might be enough in test or pre-prod environments where performance is secondary. If you find that the Gloo management server translation time is continuously above 60s in your environment, you can try to improve the performance by allocating more CPU and memory resources to the Gloo management server.

To find an overview of the minimum resource requirements, see System requirements

Number of clusters

The number of clusters that you add to your Gloo environment impacts the number of Gloo agents that need to be kept in sync and the time it takes to properly propagate changes in your environment. If you find that the reconciliation time (A to C) is continuously above 120s, you can try to scale the Gloo management server pod to multiple replicas.

Internal scalability tests

Solo runs internal scalability tests for every release to verify translation and user experience times, measure scalability and performance improvements between releases, and to confirm scalability in emulated customer environments.

You can review the results of a sample scalability test that was run by Solo engineers. The test results are meant to show the scalability that was achieved and verified for Gloo Mesh in a given test setup with a predefined set of test data and workloads.

Scalability of Gloo Mesh can vary greatly and is highly dependend on various factors, such as the workspace boundary definitions, the number of workloads, the number of namespaces, available compute resources, and more. If you are interested in validating the performance and scalability of Gloo Mesh in your environment or if you need help with optimizing the performance, contact your account representative. Solo engineers can help with replicating your production environment and running performance tests to verify the maximum number of workloads and resources that Gloo Mesh supports for your environment.

Gloo Mesh Enterprise version

Tests were performed on Gloo Mesh version 2.2.5.

Test environment setup

The following compute resources were used to run the scalability test.

Compute resource Unit
Management server pod CPU 4 vCPUs
Management server pod memory 8Gi
Management server pod replica count 1
Agent pod CPU 2 vCPUs
Agent pod memory 2Gi
Agent pod replica count 1
Istiod pod CPU 2 vCPUs
Istiod memory 2Gi
Number of management clusters 1
Number of workload clusters 3
Node config for management cluster 2 nodes with 8 vCPU and 32Gi memory
Node config for workload clusters 4 nodes with 4 vCPU and 16Gi memory

Load increments

During the scalability test, load was added to the test environment in increments as shown in the following table.

Resource Amount
Namespaces 3
Workspaces 6
Workspace connections (number of workspaces to export to/ import from) 2
Workloads (Kubernetes Services) per namespace 4
Total number of workloads (multiply the namespaces, workspaces, and workloads) 72
Route tables 1 per total workload
Virtual destinations 1 per total workload
Header manipulation 1 per workspace
Transformation policy 1 per workspace

Test procedure

The scalability test comprised increasing the load and total number of Gloo resources as defined in the load increments. After each load increment, the performance of Gloo Mesh was measured. Tests were stopped when one of the scalability thresholds was reached as defined in the Gloo Mesh scalability threshold definition.

Test results

The following graphs illustrate the scalability that was achieved in Gloo Mesh during the scalability tests before the translation time threshold (B) was reached.

Scalability results

Refer to the following tables to find a detailed overview of the number of workloads, Gloo Mesh resources, translation and user experience times that were achieved during the scalability tests. These numbers are represented in the graphs.

Runs Total number of workloads Route tables Virtual destinations Header manipulation policy Transformation policy Total number of Gloo resources
1 72 72 72 9 9 162
2 144 144 144 15 15 318
3 216 216 216 21 21 474
4 288 288 288 27 27 630
5 360 360 360 33 33 786
6 432 432 432 39 39 942
7 504 504 504 45 45 1098
8 576 576 576 51 51 1254
9 648 648 648 57 57 1410
10 720 720 720 63 63 1566
11 792 792 792 69 69 1722
12 864 864 864 75 75 1878
13 936 936 936 81 81 2034
14 1008 1008 1008 87 87 2190
15 1080 1080 1080 93 93 2346
16 1152 1152 1152 99 99 2502
17 1224 1224 1224 105 105 2658
18 1296 1296 1296 111 111 2814
19 1368 1368 1368 117 117 2970
20 1440 1440 1440 123 123 3126
21 1512 1512 1512 129 129 3282
22 1584 1584 1584 135 135 3438
23 1656 1656 1656 141 141 3594
24 1728 1728 1728 147 147 3750
25 1800 1800 1800 153 153 3906
26 1872 1872 1872 159 159 4062
27 1944 1944 1944 165 165 4218
28 2016 2016 2016 171 171 4374
29 2088 2088 2088 177 177 4530
30 2160 2160 2160 183 183 4686
31 2232 2232 2232 189 189 4842
32 2304 2304 2304 195 195 4998
Run User experience time (s) Translation time (s) Input snapshot size (MB) Output snapshot size (MB)
1 3.630771399 0.6768629946 2.497 1.379
2 5.789293289 1.566540955 3.459 2.708
3 8.204482794 2.510457065 4.423 4.04
4 10.23336458 3.319648577 5.387 5.372
5 10.37551022 4.527299008 6.351 6.704
6 17.11259079 5.73963795 7.315 8.037
7 14.73311281 6.281648328 8.279 9.369
8 14.95479655 8.694260323 9.242 10.701
9 19.2776525 8.221313328 10.207 12.033
10 19.3824265 9.931342876 11.17 13.366
11 19.0717845 10.84953012 12.134 14.698
12 38.0724237 13.18595717 13.098 16.03
13 34.86034393 14.60871485 14.062 17.362
14 25.97134995 15.3668277 15.025 18.695
15 30.40479279 16.72147813 15.966 20.027
16 37.39088464 19.83334754 16.954 21.359
17 41.77419281 20.65257513 17.89 22.695
18 41.59029174 21.78748936 18.887 24.034
19 57.85606718 23.83600715 19.854 25.374
20 39.10190034 25.32169496 20.737 26.713
21 37.00853896 24.74112544 21.692 28.053
22 64.06909776 26.40160928 22.737 29.392
23 55.11466908 28.49822162 23.724 30.731
24 64.04015565 26.24104102 24.629 32.071
25 71.12533164 31.3953137 25.59 33.41
26 54.75058913 36.37591711 26.513 34.75
27 54.98820114 37.08584548 27.49 36.089
28 70.60482359 32.99328406 28.491 37.429
29 79.73433065 37.73269123 29.419 38.768
30 73.30540466 47.46800078 30.392 40.108
31 97.94193006 44.47086639 31.359 41.447
32 96.01194096 62.545793 32.327 42.786

If you plan to create thousands of workloads in your Gloo Mesh Enterprise environment, contact your account representative so that Solo engineers can help with replicating your production environment and running performance tests to verify the maximum number of workloads and resources that Gloo Mesh supports for your environment.