On this page

Best practices for production

Review the recommended practices for preparing optional security measures and setting up Gloo Mesh Gateway in a production environment.

Deployment model

A production Gloo Mesh Gateway setup consists of one management cluster that the Gloo management components are installed in, and one or more workload clusters that run services meshes which are registered with and managed by Gloo Mesh Gateway. The management cluster serves as the management plane, and the workload clusters serve as the data plane, as depicted in the following diagram.

By default, the management server is deployed with one replica. To increase availability, you can increase the number of replicas that you deploy in the management cluster.

Additionally, you can create multiple management clusters, and deploy one or more replicas of the management server to each cluster. For more information, see High availability and disaster recovery.

In a production deployment, you typically want to avoid installing the management plane into a workload cluster that also runs a services mesh. Although Gloo Mesh Gateway remains fully functional when the management and agent components both run within the same cluster, you might have noisy neighbor concerns in which workload pods consume cluster resources and potentially constrain the management processes. This constraint on management processes can in turn affect other workload clusters that the management components oversee. However, you can prevent resource consumption issues by using Kubernetes best practices, such as node affinity, resource requests, and resource limits. Note that you must also ensure that you use the same name for the cluster during both the management plane installation and cluster registration.

Figure of a multicluster Gloo quick-start architecture, with a dedicated management cluster.

Management plane settings

Before you install the Gloo management plane into your management cluster, review the following options to help secure your installation. Each section details the benefits of the security option, and the necessary settings to specify in a Helm values file to use during your Helm installation.

check_circle

You can see all possible fields for the Helm chart by running the following command:

  helm show values gloo-platform/gloo-platform --version v2.4.16 > all-values.yaml

You can also review these fields in the Helm values documentation.

Certificate management

When you install Gloo Mesh Gateway by using meshctl or the instructions that are provided in the getting started guide, Gloo Mesh Gateway generates a self-signed root CA certificate and key that is used to generate the server TLS certificate for the Gloo management server. In addition, an intermediate CA certificate and key are generated that are used to sign client TLS certificates for every Gloo agent. For more information about the default setup, see Self-signed CAs with automatic client certificate rotation.

Using self-signed certificates and keys for the root CA and storing them on the management cluster is not a recommended security practice. The root CA certificate and key is very sensitive information, and, if compromised, can be used to issue certificates for all agents in a workload cluster. In a production-level setup you want to make sure that the root CA credentials are properly stored with your preferred PKI provider, such as AWS Private CA, Google Cloud CA, or Vault and that you use a certificate management tool, such as cert-manager to automate the issuing and renewing of certificates.

Use the following links to learn about your setup options in production:

Overrides for default components

In some cases, you might need to modify the default deployment or service for the Gloo Mesh Gateway components, such as the management server or agent. To do so, you can configure the deploymentOverrides and serviceOverrides settings for each component in your Helm values file. Then, you can upgrade your Gloo Mesh Gateway installation to apply these new settings. Keep in mind that the component might be restarted in order to apply the new settings.

For settings that are key-value dictionaries, the overrides replace any existing keys in the default template. If the overrides do not match any existing keys, then the override values are added to the existing values, such as the following example.

In the Helm values file, you configure override values for labels on the management server.

  
glooMgmtServer:
  deploymentOverrides:
    metadata:
      labels:
        # Set your own value to override the 'app' key.
        app: my-gloo
        # Add a new label for the team.
        team: infra

The default Helm template for the management server includes several labels.

  apiVersion: apps/v1
kind: Deployment
metadata:
  ...
  labels:
    # The default value for the 'app' key.
    app: gloo-mesh-mgmt-server
    app.kubernetes.io/managed-by: Helm
  name: gloo-mesh-mgmt-server
  namespace: gloo-mesh
...

Notice that the app key’s value is overridden and the app.kubernetes.io/managed-by value is merged.

  apiVersion: apps/v1
kind: Deployment
metadata:
  ...
  labels:
    # Overridden app label.
    app: my-gloo
    # Merged from the default template.
    app.kubernetes.io/managed-by: Helm
    # Merged from the override values.
    team: infra
  name: gloo-mesh-mgmt-server
  namespace: gloo-mesh
...

For settings that are lists, the overrides replace any existing lists in the default template, such as the following example.

In the Helm values file, you configure the management server with one override volume for your own Redis.

  
glooMgmtServer:
  deploymentOverrides:
    spec:
      template:
        spec:
          volumes:
            - name: my-redis
              secret:
                secretName: my-redis
                optional: false

The default Helm template for the management server includes several volumes for different purposes, such as the product license, Redis authentication, and Redis configuration.

  apiVersion: apps/v1
kind: Deployment
metadata:
  name: gloo-mesh-mgmt-server
  namespace: gloo-mesh
spec:
  template:
    spec:
      volumes:
        - name: license-keys
          secret:
            secretName: license-keys
            optional: false
        - name: redis-auth-secrets
          secret:
            secretName: redis-auth-secrets
            optional: false
        - name: redis-client-config
          configMap:
            name: redis-client-config
            optional: false
        - name: redis-certs
          secret:
            secretName: redis-certs
            optional: false
...

Notice that the default template’s list of volumes is overridden. Be especially careful that you do not accidentally override required volumes such as the product license or other list settings (such as container, env or imagePullSecrets) that you need.

  apiVersion: apps/v1
kind: Deployment
metadata:
  name: gloo-mesh-mgmt-server
  namespace: gloo-mesh
spec:
  template:
    spec:
      volumes:
        - name: my-redis
          secret:
            secretName: my-redis
            optional: false
...

Example service override

Most commonly, the serviceOverrides section specifies cloud provider-specific annotations that might be required for your environment. For example, the following section applies the recommended Amazon Web Services (AWS) annotations for modifying the created load balancer service.

  
glooMgmtServer:
  serviceOverrides:
    metadata:
      annotations:
        # AWS-specific annotations
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9900"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "tcp"

        service.beta.kubernetes.io/aws-load-balancer-type: external
        service.beta.kubernetes.io/aws-load-balancer-scheme: internal
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
        service.beta.kubernetes.io/aws-load-balancer-backend-protocol: TCP
        service.beta.kubernetes.io/aws-load-balancer-private-ipv4-addresses: 10.0.50.50, 10.0.64.50
        service.beta.kubernetes.io/aws-load-balancer-subnets: subnet-0478784f04c486de5, subnet-09d0cf74c0117fcf3
        service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.connection_termination.enabled=true,deregistration_delay.timeout_seconds=1
  # Kubernetes load balancer service type
  serviceType: LoadBalancer
  ...

You can apply service overrides to the following components:

glooAgent
glooAnalyzer
glooInsightsEngine
glooMgmtServer
glooPortalServer
glooSpireServer
glooUi
redis
redisStore for the management plane (insights and snapshot) and data plane (external auth service and rate limiter)

Example deployment overrides

For some components, you might want to modify the default deployment settings, such as the metadata or resource limits for CPU and memory. Or, you might want to provide your own resource such as a config map, service account, or volume that you mount to the deployment. This example shows how you might use the deploymentOverrides to specify a config map for a volume.

  
glooMgmtServer:
  deploymentOverrides:
    spec:
      template:
        spec:
          volumes:
            - name: envoy-config
              configMap:
                name: my-custom-envoy-config
  ...

You can apply deployment overrides to the following components:

glooAgent
glooAnalyzer
glooInsightsEngine
glooMgmtServer
glooPortalServer
glooSpireServer
glooUi
redis
redisStore for the management plane (insights and snapshot) and data plane (external auth service and rate limiter)

FIPS-compliant image

If your environment runs workloads that require federal information processing compliance, you can use images of Gloo Mesh Gateway components that are specially built to comply with NIST FIPS. Open the values.yaml file, search for the image section, and append -fips to the tag, such as in the following example.

  ...
glooMgmtServer:
  image:
    pullPolicy: IfNotPresent
    registry: gcr.io/gloo-mesh
    repository: gloo-mesh-mgmt-server
    tag: 2.4.16-fips

Licensing

During installation, you can provide your license key strings directly in license fields such as glooMeshLicenseKey. For a more secure setup, you might want to provide those license keys in a secret named license-secret instead. For more information, see Provide your license key during installation.

Prometheus metrics

By default, a Prometheus instance is deployed with the management plane Helm chart to collect metrics for the Gloo management server. For a production deployment, you can either replace the built-in Prometheus server with your own instance, or remove high cardinality labels. For more information on each option, see Customization options.

Redis instance

By default, a Redis instance is deployed for certain management plane components, such as the Gloo management server and Gloo UI. For a production deployment, you can disable the default Redis deployment and provide your own backing instance instead.

For more information, see Backing databases.

Redis safe mode

In versions 2.4.11 and lower, a race condition was identified that can be triggered during simultaneous restarts of the management plane and Redis, including an upgrade to a newer Gloo Mesh Gateway version. If hit, this failure mode can lead to partial translations on the Gloo management server which can result in Istio resources being temporarily deleted from the output snapshots that are sent to the Gloo agents. For more information about this failure scenario, see Redis and Gloo management server restart.

To resolve this issue, a new safe mode feature was added. For more information, see Safe mode. With safe mode enabled, translation halts translation until the input snapshots of all workload clusters are present in the Redis cache.

To enable safe mode, follow these general steps:

Scale down the number of Gloo management server pods to 0.

  kubectl scale deployment gloo-mesh-mgmt-server --replicas=0 -n gloo-mesh

Upgrade your Gloo Mesh Gateway installation. Add the following settings in the Helm values file for the Gloo management plane.
```
  
glooMgmtServer:
  safeMode: true
  
```
Scale the Gloo management server back up to the number of desired replicas. The following example uses 1 replica.
```
  kubectl scale deployment gloo-mesh-mgmt-server --replicas=1 -n gloo-mesh
  
```

Safe start window

Safe mode halts translation until the input snapshots of all workload clusters are present in the Redis cache. However, if clusters have connectivity issues, translation might be halted for a long time, even for healthy clusters. You might want translation to resume after a certain period of time, even if some input snapshots are missing. To do so, use the glooMgmtServer.safeStartWindow field in your Gloo management server Helm values file. This window represents the time in seconds that the Gloo management server halts translation until the Gloo agents of all workload clusters connect and send their input snapshots to populate the Redis cache.

To enable a safe start window as part of your Gloo Mesh Gateway upgrade:

Scale down the number of Gloo management server pods to 0.

  kubectl scale deployment gloo-mesh-mgmt-server --replicas=0 -n gloo-mesh

Upgrade your Gloo Mesh Gateway installation. Add the following settings in the Helm values file for the Gloo management plane.
```
  
glooMgmtServer:
  safeMode: false
  safeStartWindow: 90
  
```
Scale the Gloo management server back up to the number of desired replicas. The following example uses 1 replica.
```
  kubectl scale deployment gloo-mesh-mgmt-server --replicas=1 -n gloo-mesh
  
```

Redis I/O threads

If you plan to use the built-in Redis instance in production and you experience performance issues, you can increase the number of I/O threads in Redis by using the redis.deployment.ioThreads Helm option. Redis is mostly single threaded, however some operations, such as UNLINK or slow I/O accesses can be performed on side threads. Increasing the number of side threads can help improve and maximize the performance of Redis as these operations can run in parallel.

report

The default and minimum valid value for this setting is 1. If you plan to increase the number of I/O side threads, make sure that you also change the CPU requests and CPU limits for the Redis pod. Set the CPU requests and limits to the same number that you use for the I/O side threads plus 1. That way, you can ensure that each side thread has an available CPU core, and that an additional CPU core is left for the main Redis thread. For example, if you want to set I/O threads to 2, make sure to add 3 CPU cores to the resource requests and limits for the Redis pod. You can find further recommendations regarding I/O threads in this Redis configuration example.

If you set I/O threads, the Redis pod must be restarted during the upgrade so that the changes can be applied. During the restart, the input snapshots from all connected Gloo agents are removed from the Redis cache. If you also update settings in the Gloo management server that require the management server pod to restart, the management server’s local memory is cleared and all Gloo agents are disconnected. Although the Gloo agents attempt to reconnect to send their input snapshots and re-populate the Redis cache, some agents might take longer to connect or fail to connect at all. To ensure that the Gloo management server halts translation until the input snapshots of all workload cluster agents are present in Redis, it is recommended to enable safe mode on the management server alongside updating the I/O threads for the Redis pod. For more information, see Safe mode. Note that in version 2.6.0 and later, safe mode is enabled by default.

To update I/O side threads in Redis as part of your Gloo Mesh Gateway upgrade:

Scale down the number of Gloo management server pods to 0.

  kubectl scale deployment gloo-mesh-mgmt-server --replicas=0 -n gloo-mesh

Upgrade Gloo Mesh Gateway and use the following settings in your Helm values file for the management server. Make sure to also increase the number of CPU cores to one core per thread, and add an additional CPU core for the main Redis thread. The following example also enables safe mode on the Gloo management server to ensure translation is done with the complete context of all workload clusters.
```
  
glooMgmtServer:
  safeMode: true
redis: 
  deployment: 
    ioThreads: 2
    resources: 
      requests: 
        cpu: 3
      limits: 
        cpu: 3
  
```
Scale the Gloo management server back up to the number of desired replicas. The following example uses 1 replica.
```
  kubectl scale deployment gloo-mesh-mgmt-server --replicas=1 -n gloo-mesh
  
```

Break up large Envoy filters

Some Gloo policies, such as JWT or other external auth policies are translated into Envoy filters during the Gloo translation process. These Envoy filters are stored in the Kubernetes data store etcd alongside other Gloo configurations and applied to the ingress gateway or sidecar proxy to enforce the policies. In environments where you apply policies to a lot of apps and routes, the size of the Envoy filter can become very large and exceed the maximum file size limit in etcd. When the maximum file size limit is reached, new configuration is rejected in etcd and Istio, which leads to policies not being applied and enforced properly.

To prevent this issue in your environment, it is recommended to set the new EXPERIMENTAL_SEGMENT_ENVOY_FILTERS_BY_MATCHER environment variable on the Gloo management server to instruct the server to break up large Envoy filters into multiple smaller Envoy filters. In your Helm values file for the Gloo management server, add the following snippet:

  
glooMgmServer: 
  extraEnvs:
    EXPERIMENTAL_SEGMENT_ENVOY_FILTERS_BY_MATCHER:
      value: "true"

report

Important: To safely upgrade and ensure existing Envoy filters are correctly re-created, the Gloo management server, and the Istio control plane istiod must temporarily be scaled down to 0 replicas. This upgrade procedure can have the following implications for your environment:

Delayed configuration updates: During the upgrade, the Gloo management server and istiod control plane are temporarily scaled down. Because of that, the propagation of configuration changes to the sidecar or gateway proxy, such as new routing rules or security policies, is delayed. This can cause inconsistencies in traffic management and policy enforcement.
Complex environments with long translation times: If you have a complex environment and your average translation time regularly takes more than 60 seconds, scaling down istiod might have unexpected impacts and delay the time for your traffic to continue as normal.
New pods cannot be added to the mesh: The Istio control plane istiod implements the sidecar injection webhook. When the control plane is scaled down, sidecar injection does not work and new pods cannot be added to the service mesh. You can manually inject sidecars into your pods. However, keep in mind that these pods do not receive traffic as endpoint discovery is also disabled when the Istio control plane is scaled down. After the control plane is scaled back up, pods are automatically injected with sidecars and added to the mesh.
mTLS certificate issues: If certificates expire while the Istio control plane is not available, mutual TLS between services in the mesh might be impacted.

Note that the EXPERIMENTAL_SEGMENT_ENVOY_FILTERS_BY_MATCHER environment variable is removed in Gloo Mesh Gateway version 2.5.0. This is because the Envoy filter segmentation is promoted to standard behavior and enabled by default. You no longer need to set the environment variable. If you want to enable this feature in version 2.3.x or 2.4.x, use the upgrade steps in version 2.5 as a general guidance for how to safely scale down the Gloo management server, Gloo agent, and istiod, and re-create the Envoy filters in your environment.

UI authentication

The Gloo UI supports OpenID Connect (OIDC) authentication from common providers such as Google, Okta, and Auth0. Users that access the UI will be required to authenticate with the OIDC provider, and all requests to retrieve data from the API will be authenticated.

You can configure OIDC authentication for the UI by providing your OIDC provider details in the glooUi section, such as the following.

  ...
glooUi:
  enabled: true
  auth:
    enabled: true
    backend: oidc
    oidc:
      appUrl: # The URL that the UI for the OIDC app is available at, from the DNS and other ingress settings that expose the OIDC app UI service.
      clientId: # From the OIDC provider
      clientSecret: # From the OIDC provider. Stored in a secret.
      clientSecretName: dashboard
      issuerUrl: # The issuer URL from the OIDC provider, usually something like 'https://<domain>.<provider_url>/'.

Data plane settings

Before you register workload clusters with Gloo Mesh Gateway, review the following options to help secure your registration. Each section details the benefits of the security option, and the necessary settings to specify in a Helm values file to use during your Helm registration.

check_circle

You can see all possible fields for the Helm chart by running the following command:

  helm show values gloo-platform/gloo-platform --version v2.4.16 > all-values.yaml

You can also review these fields in the Helm values documentation.

FIPS-compliant image

  ...
glooAgent:
  image:
    pullPolicy: IfNotPresent
    registry: gcr.io/gloo-mesh
    repository: gloo-mesh-agent
    tag: 2.4.16-fips

Certificate management

If you use the default self-signed certificates during Gloo Mesh Gateway installation, you can follow the steps in the cluster registration documentation to use these certificates during cluster registration. If you set up Gloo Mesh Gateway without secure communication for quick demonstrations, include the --set insecure=true flag during registration. Note that using the default self-signed certificate authorities (CAs) or using insecure mode are not suitable for production environments.

In production environments, you use the same custom certificates that you set up for Gloo Mesh Gateway installation during cluster registration:

Ensure that when you installed Gloo Mesh Gateway, you set up the relay certificates, such as with AWS Certificate Manager, HashiCorp Vault, or your own custom certs, including the relay forwarding and identity secrets in the management and workload clusters.

The relay certificate instructions include steps to modify your Helm values file to use the custom CAs, such as in the following relay section. Note that you might need to update the clientTlsSecret name and rootTlsSecret name values, depending on your certificate setup.

  
common:
  insecure: false
glooAgent:
  insecure: false
  relay:
    authority: gloo-mesh-mgmt-server.gloo-mesh
    clientTlsSecret:
      name: gloo-mesh-agent-$REMOTE_CLUSTER-tls-cert
      namespace: gloo-mesh
    rootTlsSecret:
      name: relay-root-tls-secret
      namespace: gloo-mesh
    serverAddress: $MGMT_SERVER_NETWORKING_ADDRESS
...

Kubernetes RBAC

To review the permissions of deployed Gloo components such as the management server and agent, see Gloo component permissions.

Best practices for production

Deployment model link

Management plane settings link

Certificate management link

Overrides for default components link

Example service override link

Example deployment overrides link

FIPS-compliant image link

Licensing link

Prometheus metrics link

Redis instance link

Redis safe mode link

Safe start window link

Redis I/O threads link

Break up large Envoy filters link

UI authentication link

Data plane settings link

FIPS-compliant image link

Certificate management link

Kubernetes RBAC link

Deployment model

Management plane settings

Certificate management

Overrides for default components

Example service override

Example deployment overrides

FIPS-compliant image

Licensing

Prometheus metrics

Redis instance

Redis safe mode

Safe start window

Redis I/O threads

Break up large Envoy filters

UI authentication

Data plane settings

FIPS-compliant image

Certificate management

Kubernetes RBAC