Best practices for production

Review the following recommended practices for preparing optional security measures and setting up Gloo Mesh Enterprise in a production environment.

Deployment model

A production Gloo Mesh setup consists of one management cluster that the Gloo Mesh management components are installed in, and one or more workload clusters that run services meshes which are registered with and managed by Gloo Mesh. The management cluster serves as the management plane, and the workload clusters serve as the data plane, as depicted in the following diagram.

By default, the management server is deployed with one replica. To increase availability, you can increase the number of replicas that you deploy in the management cluster. Additionally, you can create multiple management clusters, and deploy one or more replicas of the managent server to each cluster. For more information, see High availability and disaster recovery.

In a production deployment, you typically want to avoid installing the management plane into a workload cluster that also runs a service mesh. Although Gloo Mesh remains fully functional when the management and agent components both run within the same cluster, you might have noisy neighbor concerns in which workload pods consume cluster resources and potentially constrain the management processes. This constraint on management processes can in turn affect other workload clusters that the management components oversee. However, you can prevent resource consumption issues by using Kubernetes best practices, such as node affinity, resource requests, and resource limits. Note that you must also ensure that you use the same name for the cluster during both the management plane installation and cluster registration.

Figure of a multicluster Gloo Mesh quick-start architecture, with a dedicated management cluster.

Management plane settings

Before you install the Gloo Mesh management plane into your management cluster, review the following options to help secure your installation. Each section details the benefits of the security option, and the necessary settings to specify in a Helm values file to use during your Helm installation.

You can see all possible fields for the Helm chart by running the following command:

helm show values gloo-platform/gloo-platform --version v2.5.4 > all-values.yaml

You can also review these fields in the Helm values documentation.

Licensing

During installation, you can provide your license key strings directly in license fields such as glooMeshLicenseKey. For a more secure setup, you might want to provide those license keys in a secret instead.

  1. Before you install Gloo Mesh, create a secret with your license keys in the gloo-mesh namespace of your management cluster.
    cat << EOF | kubectl apply -n gloo-mesh -f -
    apiVersion: v1
    kind: Secret
    type: Opaque
    metadata:
      name: license-secret
      namespace: gloo-mesh
    data:
      gloo-mesh-license-key: ""
      gloo-network-license-key: ""
      gloo-gateway-license-key: ""
      gloo-trial-license-key: ""
    EOF
    
  2. When you install the Gloo Mesh management plane in your management cluster, specify the secret name as the value for the licensing.licenseSecretName field in your Helm values file.

FIPS-compliant image

If your environment runs workloads that require federal information processing compliance, you can use images of Gloo Mesh Enterprise components that are specially built to comply with NIST FIPS. Open the values.yaml file, search for the image section, and append -fips to the tag, such as in the following example.
...
glooMgmtServer:
  image:
    pullPolicy: IfNotPresent
    registry: gcr.io/gloo-mesh
    repository: gloo-mesh-mgmt-server
    tag: 2.5.4-fips

Certificate management

When you install Gloo Mesh Enterprise by using meshctl or the instructions that are provided in the getting started guide, Gloo Mesh generates a self-signed root CA certificate and key that is used to generate the server TLS certificate for the Gloo management server. In addition, an intermediate CA certificate and key are generated that are used to sign client TLS certificates for every Gloo agent. For more information about the default setup, see Option 2: Gloo Mesh self-signed CAs with automatic client certificate rotation.

Using self-signed certificates and keys for the root CA and storing them on the management cluster is not a recommended security practice. The root CA certificate and key is very sensitive information, and, if compromised, can be used to issue certificates for all agents in a workload cluster. In a production-level setup you want to make sure that the root CA credentials are properly stored with your preferred PKI provider, such as AWS Private CA, Google Cloud CA, or Vault and that you use a certificate management tool, such as cert-manager to automate the issuing and renewing of certificates.

Use the following links to learn about your setup options in production:

Deployment and service overrides

In some cases, you might need to modify the default deployment of the glooMgmtServer with your own Kubernetes resources. You can specify resources and annotations for the management server deployment in the glooMgmtServer.deploymentOverrides field, and resources and annotations for the service that exposes the deployment in the glooMgmtServer.serviceOverrides field.

Most commonly, the serviceOverrides section specifies cloud provider-specific annotations that might be required for your environment. For example, the following section applies the recommended Amazon Web Services (AWS) annotations for modifying the created load balancer service.

glooMgmtServer:
  serviceOverrides:
    metadata:
      annotations:
        # AWS-specific annotations
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9900"
        service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "tcp"

        service.beta.kubernetes.io/aws-load-balancer-type: external
        service.beta.kubernetes.io/aws-load-balancer-scheme: internal
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
        service.beta.kubernetes.io/aws-load-balancer-backend-protocol: TCP
        service.beta.kubernetes.io/aws-load-balancer-private-ipv4-addresses: 10.0.50.50, 10.0.64.50
        service.beta.kubernetes.io/aws-load-balancer-subnets: subnet-0478784f04c486de5, subnet-09d0cf74c0117fcf3
        service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: deregistration_delay.connection_termination.enabled=true,deregistration_delay.timeout_seconds=1
  # Kubernetes load balancer service type
  serviceType: LoadBalancer
  ...

In less common cases, you might want to provide other resources, like a config map or service account. This example shows how you might use the deploymentOverrides to specify a config map in a volume mount.

glooMgmtServer:
  deploymentOverrides:
    spec:
      template:
        spec:
          volumeMounts:
            - name: envoy-config
              configMap:
                name: my-custom-envoy-config
  ...

UI authentication

The Gloo UI supports OpenID Connect (OIDC) authentication from common providers such as Google, Okta, and Auth0. Users that access the UI will be required to authenticate with the OIDC provider, and all requests to retrieve data from the API will be authenticated.

You can configure OIDC authentication for the UI by providing your OIDC provider details in the glooUi section, such as the following.

...
glooUi:
  enabled: true
  auth:
    enabled: true
    backend: oidc
    oidc:
      appUrl: # The URL that the UI for the OIDC app is available at, from the DNS and other ingress settings that expose the OIDC app UI service.
      clientId: # From the OIDC provider
      clientSecret: # From the OIDC provider. Stored in a secret.
      clientSecretName: dashboard
      issuerUrl: # The issuer URL from the OIDC provider, usually something like 'https://<domain>.<provider_url>/'.

Redis instance

By default, a Redis instance is deployed for certain management plane components, such as the Gloo management server and Gloo UI. For a production deployment, you can disable the default Redis deployment and provide your own backing database instead.

For more information, see Backing databases.

Redis safe mode options

Configure how the Gloo management server handles translation if Redis restarts. Translation is how your Gloo custom resources are turned into Istio resources to configure your gateway and service mesh. For more information, see Gloo custom resource translation. Redis is a key component of the Gloo management plane. Gloo stores the state of all your Gloo custom resources in the Redis cache. The state is in the form of an input snapshot for each registered cluster. The Gloo management server translates the input snapshots of Gloo configuration into output snapshots that contain gateway and service mesh-specific Istio resources. Then, the Gloo agent in each cluster applies the resources in the output snapshots to configure you gateway and service mesh.

During startup, Gloo Mesh Enterprise waits 180 seconds before starting translation. This warmup period lets the agents in the workload clusters connect for the first time. Then, they can send their input snapshots to populate the Redis cache. This way, translation begins with a more complete context of resources across workload clusters.

In the event that Redis restarts, such as due to insufficient memory (OOMKill), the Redis cache loses all its data, including the input snapshots. When the agents reconnect to the management server, the input snapshots are re-populated in Redis and a new translation is started. However, if you have a multicluster setup and the workload clusters interact with each other, such as when you set up multicluster routing with virtual destinations, the translated output snapshot might be incomplete if other agents take longer to reconnect to the management server or fail to reconnect at all. If incomplete output snapshot are applied in the workload cluster, Istio resources might get removed which can break multicluster routing.

Default setup

Review the following scenario to understand the default behavior in Gloo Mesh Enterprise during a Redis restart.

The following image shows a management cluster and two successfully registered workload clusters. Gloo resources that are applied in each workload cluster are sent as input snapshots to the management server where they are stored in the Redis cache. During the translation, the management server evaluates the input snapshots for all clusters to create the appropriate output snapshot for each workload cluster. The output snapshot is returned to the workload cluster. The agent uses the output snapshot to create and modify your gateway and service mesh resources.

Figure: Initial state (default mode)
In this scenario, cluster 1 disconnects and does not send updated input snapshots to the management server. However, because the input snapshot for cluster 1 is still present in the Redis cache, the management server includes the last known input snapshot of cluster 1 in the creation of the output snapshot for cluster 2. Multicluster routing continues to work across clusters. However, configuration updates from cluster 1 are not applied until cluster 1 reconnects to the management server.

Figure: Cluster 1 disconnects (default mode)
In this scenario, Redis restarts, which causes the Redis cache to get removed. Cluster 1 fails to reconnect to the management cluster. The Redis cache for cluster 1 is not re-populated. However, cluster 2 reconnects successfully and sends an input snapshot that re-populates the Redis cache for cluster 2. During the translation, the management server considers the input snapshot of cluster 2 only and creates an output snapshot that misses multicluster routing resources from cluster 1. When the output snapshot is applied in cluster 2, resources might get removed and multicluster routing between cluster 1 and cluster 2 might break.

Figure: Redis restarts (default mode)

Review the options that you have to prevent incomplete translations from being applied in workload clusters. The option that is right for you depends on your environment, especially the number of workload clusters that are connected to the Gloo management server, your network latency, and Gloo translation times.

Option 1: Safe mode

In the event that Redis restarts and has its cache deleted, the Gloo management server halts translation. Translation does not resume until the agents in each workload cluster reconnect to the management server. At that point, the management server uses the agents’ input snapshots to re-populate the Redis cache. Then, the management server resumes translation and provides an updated output snapshot back to the agents. Until translation resumes, the agents use the last provided output snapshot. This way, the agents only apply and modify your resources based on a complete translation context.

To enable this setting, add the following values to the Helm values file for the Gloo management plane:

glooMgmtServer:
  safeMode: true
featureGates: 
  safeMode: true  

Review the following example to learn more about how safe mode works.

The following image shows a management cluster and two successfully registered workload clusters. The Gloo management server is configured for safe mode. Gloo resources that are applied in each workload cluster are sent as input snapshots to the management server where they are stored in the Redis cache. During the translation, the management server evaluates the input snapshots for all clusters to create the appropriate output snapshot for each workload cluster. The output snapshot is returned to the workload cluster.

Figure: Initial state (safe mode)
In this scenario, cluster 1 disconnects and does not send updated input snapshots to the management server. However, because the input snapshot for cluster 1 is still present in the Redis cache, the management server includes the last known input snapshot of cluster 1 in the creation of the output snapshot for cluster 2. Multicluster routing continues to work across clusters. However, configuration updates from cluster 1 are not applied until cluster 1 reconnects to the management server.

Figure: Cluster 1 disconnects (safe mode)
In this scenario, Redis restarts, which causes the Redis cache to get removed. Cluster 1 fails to reconnect to the management cluster. The Redis cache for cluster 1 is not re-populated. However, cluster 2 reconnects successfully and sends an input snapshot that re-populates the Redis cache for cluster 2. The management server halts the translation for all workload clusters until the Redis cache is re-populated for all workload clusters. Cluster 2 continues to serve the old configuration that was last sent by the Gloo management server. Because of that, multicluster routing between cluster 1 and cluster 2 continues to work.

Figure: Redis restarts (safe mode)

SafeMode only applies to workload clusters that were connected to the Gloo mangement server before Redis restarted. If you registered a workload cluster, but the cluster never connected with the management server to populate the Redis cache prior to the Redis restart, the cluster is not included in the safe mode and can receive incomplete translations if the cluster connects to the management server while safe mode is being applied.

Exclude clusters from safe mode

You can optionally exclude clusters from the safe mode by adding the skipWarming option to the KubernetesCluster custom resource that represents the cluster that you want to exclude. The skipWarming setting instructs the Gloo management server to start the translation, even if the Redis cache was not populated with the latest input snapshot for that cluster.

kubectl apply --context $MGMT_CONTEXT -f- <<EOF
apiVersion: admin.gloo.solo.io/v2
kind: KubernetesCluster
metadata:
  name: $REMOTE_CLUSTER1
  namespace: gloo-mesh
  labels:
    env: prod
spec:
  clusterDomain: cluster.local
  skipWarming: true
EOF

For example, you might want to register a new workload cluster. During the registration process, the KubernetesCluster is created. However, the agent in the workload cluster cannot connect to the management server due to a connectivity issue. If safe mode is turned on, the translation for all workload clusters halts until the connectivity issue is resolved and an input snapshot is sent from the newly registered cluster so that the Redis cache is populated. You can prevent this scenario by setting skipWarming: true in the KubernetesCluster resource of the workload cluster that you want to register. After the cluster connects successfully, you can remove this setting or explicitly set skipWarming: false to include the cluster in the safe mode.

Option 2: Safe start window

With safe mode, the Gloo management server halts translation until the input snapshots of all workload clusters are in the Redis cache. However, if clusters have connectivity issues, translation might be halted for a long time, even for healthy clusters. You might want translation to resume after a certain period of time, even if some input snapshots are missing. To do so, use the glooMgmtServer.safeStartWindow field in your Gloo management server Helm values file. This window represents the time in seconds that the Gloo management server halts translation until the Gloo agents of all workload clusters connect and send their input snapshots to populate the Redis cache. This behavior is the same as described in Option 1: Safe mode. After the time expires, the default behavior kicks in and the management server starts translation by using the input snapshots in Redis. Missing snapshots are not included in the output snapshot for each workload cluster.

The default wait time of the Gloo management server is 180 seconds. If you do not want the management server to wait, you can set this option to 0 (zero). However, keep in mind that setting this option to 0 can lead to incomplete output snapshots in multicluster setups.

To set a safe start window, add the following values to your Helm values file for the Gloo management plane:

glooMgmtServer:
  safeStartWindow: 90
featureGates: 
  safeMode: true  

If you enabled safe mode on the Gloo management server, the safe start window setting is ignored.

Prometheus metrics

By default, a Prometheus instance is deployed with the management plane Helm chart to collect metrics for the Gloo Mesh management server. For a production deployment, you can either replace the built-in Prometheus server with your own instance, or remove high cardinality labels. For more information on each option, see Customization options.

Data plane settings

Before you register workload clusters with Gloo Mesh, review the following options to help secure your registration. Each section details the benefits of the security option, and the necessary settings to specify in a Helm values file to use during your Helm registration.

You can see all possible fields for the Helm chart by running the following command:

helm show values gloo-platform/gloo-platform --version v2.5.4 > all-values.yaml

You can also review these fields in the Helm values documentation.

FIPS-compliant image

If your environment runs workloads that require federal information processing compliance, you can use images of Gloo Mesh Enterprise components that are specially built to comply with NIST FIPS. Open the values.yaml file, search for the image section, and append -fips to the tag, such as in the following example.
...
glooAgent:
  image:
    pullPolicy: IfNotPresent
    registry: gcr.io/gloo-mesh
    repository: gloo-mesh-agent
    tag: 2.5.4-fips

Certificate management

If you use the default self-signed certificates during Gloo Mesh installation, you can follow the steps in the cluster registration documentation to use these certificates during cluster registration. If you set up Gloo Mesh without secure communication for quick demonstrations, include the --set insecure=true flag during registration. Note that using the default self-signed certificate authorities (CAs) or using insecure mode are not suitable for production environments.

In production environments, you use the same custom certificates that you set up for Gloo Mesh installation during cluster registration:

  1. Ensure that when you installed Gloo Mesh, you set up the relay certificates, such as with AWS Certificate Manager, HashiCorp Vault, or your own custom certs, including the relay forwarding and identity secrets in the management and workload clusters.
  2. The relay certificate instructions include steps to modify your Helm values file to use the custom CAs, such as in the following relay section. Note that you might need to update the clientTlsSecret name and rootTlsSecret name values, depending on your certificate setup.
common:
  insecure: false
glooAgent:
  insecure: false
  relay:
    authority: gloo-mesh-mgmt-server.gloo-mesh
    clientTlsSecret:
      name: gloo-mesh-agent-$REMOTE_CLUSTER-tls-cert
      namespace: gloo-mesh
    rootTlsSecret:
      name: relay-root-tls-secret
      namespace: gloo-mesh
    serverAddress: $MGMT_SERVER_NETWORKING_ADDRESS
...

Kubernetes RBAC

For information about controlling access to your Gloo resources with Kubernetes role-based access control (RBAC), see User access.

To review the permissions of deployed Gloo components such as the management server and agent, see Gloo component permissions.