Production Deployments

This document shows some of the Production options that may be useful. We will continue to add to this document and welcome users of Gloo Edge to send PRs to this as well.

Safeguarding the control plane configuration

Enable replacing invalid routes

In some cases it may be desirable to update a virtual service even if its config becomes partially invalid. This is particularly useful when delegating to Route Tables as it ensures that a single Route Table will not block updates for other Route Tables which share the same Virtual Service. More information on why and how to enable this can be found here

Example:

gloo:
  settings:
    invalidConfigPolicy:
      invalidRouteResponseBody: Gloo Gateway has invalid configuration. Administrators
        should run `glooctl check` to find and fix config errors.
      invalidRouteResponseCode: 404
      replaceInvalidRoutes: true

OpenAPI schema validation on CRDs

Since Gloo v1.8, CRDs are given OpenAPI schemas. This way, the kube api-server (validation webhook) will refuse Custom Resources with invalid definitions.

More validation hooks

In addition to the CRD schemas, Gloo can perform a deeper inspection of the Custom Resources.

You can use these flags:

gloo:
  gateway:
    validation:
      allowWarnings: false # reject if warning status or rejected status
      alwaysAcceptResources: false # reject invalid resources
      warnRouteShortCircuiting: true

xDS Relay

This feature is not supported for the following non-default installation modes of Gloo Edge: REST Endpoint Discovery (EDS), Gloo Edge mTLS mode, Gloo Edge with Istio mTLS mode

To protect against control plane downtime, you can install Gloo Edge alongside the xds-relay Helm chart. This Helm chart installs a deployment of xds-relay pods that serve as intermediaries between Envoy proxies and the xDS server of Gloo Edge.

The presence of xds-relay intermediary pods serve two purposes. First, it separates the lifecycle of Gloo Edge from the xDS cache proxies. For example, a failure during a Helm upgrade will not cause the loss of the last valid xDS state. Second, it allows you to scale xds-relay to as many replicas as needed, since Gloo Edge uses only one replica. Without xds-relay, a failure of the single Gloo Edge replica causes any new Envoy proxies to be created without a valid configuration.

To enable:

Install the xds-relay Helm chart, which supports version 2 and 3 of the Envoy API:

If needed, change the default values for the xds-relay chart:

deployment:
  replicas: 3
  image:
    pullPolicy: IfNotPresent
    registry: gcr.io/gloo-edge
    repository: xds-relay
    tag: %version%
# might want to set resources for prod deploy, e.g.:
#  resources:
#    requests:
#      cpu: 125m
#      memory: 256Mi
service:
  port: 9991
bootstrap:
  cache:
    # zero means no limit
    ttl: 0s
    # zero means no limit
    maxEntries: 0
  originServer:
    address: gloo.gloo-system.svc.cluster.local
    port: 9977
    streamTimeout: 5s
  logging:
    level: INFO
# might want to add extra, non-default identifiers
#extraLabels:
#  k: v
#extraTemplateAnnotations:
#  k: v
gatewayProxies:
  gatewayProxy: # do the following for each gateway proxy
    xdsServiceAddress: xds-relay.default.svc.cluster.local
    xdsServicePort: 9991

Performance tips

Disable Kubernetes destinations

Gloo Edge routes to upstreams by default, but it can alternatively be configured to bypass upstreams and route directly to Kubernetes destinations. Because routing to upstreams is the recommended configuration, you can disable the option to route to the Kubernetes destinations with the settings.disableKubernetesDestinations: true setting. This setting saves memory because the Gloo Edge pod doesn’t cache both upstreams and Kubernetes destinations.

You can set this value in the default Settings CR by adding the following content:

apiVersion: gloo.solo.io/v1
kind: Settings
metadata:
  name: default
  namespace: gloo-system
spec:
  gloo:
    disableKubernetesDestinations: true
    ...

You can set this value in your Helm overrides file by adding the following setting:

settings:
  disableKubernetesDestinations: true

Configure appropriate resource usage

Before running in production it is important to ensure you have correctly configured the resources allocated to the various components of Gloo Edge. Ideally this tuning will be done in conjunction with load/performance testing.

These values can be configured via helm values for the various deployments, such as

See the helm chart value reference for a full list.

Transformations

gloo:
  gateway:
    validation:
      disableTransformationValidation: true # better performances but more risky

Discovery

If you have hundreds of Kubernetes services, you can get better performances on the control-plane if you disable the Upstream and Function discovery.

gloo:
  discovery:
    enabled: false

You can also patch the default Settings CR with this value and delete the discovery deployment.

Enable Envoy’s gzip filter

Optionally, you may choose to enable Envoy’s gzip filter through Gloo Edge. More information on that can be found here.

Set up an EDS warming timeout

Set up the endpoints warming timeout to a non-zero value. More details here.

Access Logging

Envoy provides a powerful access logging mechanism which enables users and operators to understand the various traffic flowing through the proxy. Before deploying Gloo Edge in production, consider enabling access logging to help with monitoring traffic as well as to provide helpful information for troubleshooting. The access logging documentation should be consulted for more details.

Make sure you have the %RESPONSE_FLAGS% item in the log pattern.

Horizontally scaling the data plane

Gateway proxy

You can scale up the gateway-proxies Envoy instances by using a Deployment or a DeamonSet. The amount of requests that the gateway-proxies can process increases with the amount of CPU that the gateway-proxies have access to.

Ext-Auth

You can also scale up the ExtAuth service. Typically, one to two instances are sufficient.

Horizontally scaling the control plane

DO NOT scale the control plane components, such as the Gateway deployment, the Gloo deployment or the Discovery deployment. Scaling these components provides no benefit and can lead to race conditions.

If you are using the xDS-Relay architecture, you can scale the xds-relay deployment up.

Enhancing the data-plane reliability

Downstream to Envoy health checks

Liveness/readiness probes on Envoy are disabled by default. This is because Envoy’s behavior can be surprising: When there are no routes configured, Envoy reports itself as un-ready. As it becomes configured with a nonzero number of routes, it will start to report itself as ready.

Upstream health checks

In addition to defining health checks for Envoy, you should strongly consider defining health checks for your Upstreams. These health checks are used by Envoy to determine the health of the various upstream hosts in an upstream cluster, for example checking the health of the various pods that make up a Kubernetes Service. This is known as “active health checking” and can be configured on the Upstream resource directly. See the documentation for additional info.

Other considerations

Additionally, “outlier detection” can be configured which allows Envoy to passively check the health of upstream hosts. A helpful overview of this feature is available in Envoy’s documentation. This can be configured via the outlierDetection field on the Upstream resource. See the API reference for more detail .

Also, consider using retries on your routes. The default value for this attribute is 1, which you can increase to 3 for better results.

Metrics and monitoring

Grafana dashboards

A dedicated Grafana dashboard exists for each Upstream Custom Resource. These dashboards can help you verify whether your active/passive health checks are working, as well as provide insight into retries, requests rate and latency, responses codes, number of active connections, and more.

Upstream dashboard - part 1

Upstream dashboard - part 2

Upstream dashboard - part 3

Prometheus

Gloo Edge default prometheus server and grafana instance are not meant to be used as-is in production. Please provide your own instance or configure the provided one with production values

When running Gloo Edge (or any application for that matter) in a production environment, it is important to have a monitoring solution in place. Gloo Edge Enterprise provides a simple deployment of Prometheus and Grafana to assist with this necessity. However, depending on the requirements on your organization you may require a more robust solution, in which case you should make sure the metrics from the Gloo Edge components (especially Envoy) are available in whatever solution you are using. The general documentation for monitoring/observability has more info.

Some metrics that may be useful to monitor (listed in Prometheus format):

Troubleshooting monitoring components

A common issue in production (or environments with high traffic) is to have sizing issues. This will result in an abnormal number of restarts like this:

$ kubectl get all -n gloo-system
NAME                                                       READY   STATUS             RESTARTS   AGE
pod/discovery-9d4c7fb4c-5wq5m                              1/1     Running            13         35d
pod/extauth-77bb4fc79b-dsl6q                               1/1     Running            0          35d
pod/gateway-f774b4d5b-jfhwn                                1/1     Running            0          35d
pod/gateway-proxy-7656d9df87-qtn2s                         1/1     Running            0          35d
pod/gloo-db4fb8c4-lfcrp                                    1/1     Running            13         35d
pod/glooe-grafana-78c6f96db-wgl5k                          1/1     Running            0          41d
pod/glooe-prometheus-kube-state-metrics-5dd77b76fc-s8prb   1/1     Running            0          41d
pod/glooe-prometheus-server-59dcf7bc5b-jt654               1/2     CrashLoopBackOff   10692      41d
pod/observability-656d47787-2fskq                          0/1     CrashLoopBackOff   9558       33d
pod/rate-limit-7d6cf64fbf-ldgbp                            1/1     Running            0          35d
pod/redis-55d6dbb6b7-ql89p                                 1/1     Running            0          41d

Looking at the cause of these restarts, we can see that the PV is exhausted:

kubectl logs -f pod/glooe-prometheus-server-59dcf7bc5b-jt654 -n gloo-system -c glooe-prometheus-server
evel=info ts=2021-07-07T05:12:29.474Z caller=main.go:574 msg="Stopping scrape discovery manager..."
level=info ts=2021-07-07T05:12:29.474Z caller=main.go:588 msg="Stopping notify discovery manager..."
level=info ts=2021-07-07T05:12:29.474Z caller=main.go:610 msg="Stopping scrape manager..."
level=info ts=2021-07-07T05:12:29.474Z caller=manager.go:908 component="rule manager" msg="Stopping rule manager..."
level=info ts=2021-07-07T05:12:29.474Z caller=manager.go:918 component="rule manager" msg="Rule manager stopped"
level=info ts=2021-07-07T05:12:29.474Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..."
level=info ts=2021-07-07T05:12:29.474Z caller=main.go:778 msg="Notifier manager stopped"
level=info ts=2021-07-07T05:12:29.474Z caller=main.go:604 msg="Scrape manager stopped"
level=info ts=2021-07-07T05:12:29.474Z caller=main.go:570 msg="Scrape discovery manager stopped"
level=info ts=2021-07-07T05:12:29.474Z caller=main.go:584 msg="Notify discovery manager stopped"
level=error ts=2021-07-07T05:12:29.474Z caller=main.go:787 err="opening storage failed: open /data/wal/00000721: no space left on device"

Next step is to check the pv size and the retention time:

kubectl get deploy/glooe-prometheus-server -n gloo-system -oyaml|grep "image: prom/prometheus" -C 10
        - mountPath: /etc/config
          name: config-volume
          readOnly: true
      - args:
        - --storage.tsdb.retention.time=15d
        - --config.file=/etc/config/prometheus.yml
        - --storage.tsdb.path=/data
        - --web.console.libraries=/etc/prometheus/console_libraries
        - --web.console.templates=/etc/prometheus/consoles
        - --web.enable-lifecycle
        image: prom/prometheus:v2.21.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/healthy
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 15
          successThreshold: 1
kubectl get pv -oyaml|grep "glooe-prometheus-server" -C 10
    selfLink: /api/v1/persistentvolumes/pvc-a616d9c6-9733-428f-bac5-c054ca8a025b
    uid: 816c7e99-55e3-4888-83f0-2f31a8314b47
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 16Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: glooe-prometheus-server
      namespace: gloo-system
      resourceVersion: "36337"
      uid: a616d9c6-9733-428f-bac5-c054ca8a025b
    gcePersistentDisk:
      fsType: ext4
      pdName: gke-jesus-lab-observability-9bc1029-pvc-a616d9c6-9733-428f-bac5-c054ca8a025b
    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:

In this case, 16Gi volume size and 15d retention is not working well, so we must tune one or both parameters using these helm values:

prometheus.server.retention
prometheus.server.persistentVolume.size

Choosing the right values requires some business knowledge, but as a rule of thumb a fair approach is:

persistentVolume.size = retention_in_seconds * ingested_samples_per_second * bytes_per_sample

Security concerns

One of the more important (and unique) things about Gloo Edge is the ability to significantly lock down the edge proxy. Other proxies require privileges to write to disk or access the Kubernetes API, while Gloo Edge splits those responsibilities between control plane and data plane. The data plane can be locked down with zero privileges while separating out the need to secure the control plane differently.

For example, Gloo Edge’s data plane (the gateway-proxy pod) has ReadOnly file system. Additionally it doesn’t require any additional tokens mounted in or OS-level privileges. By default some of these options are enabled to simplify developer experience, but if your use case doesn’t need them, you should lock them down.

Other Envoy-specific guidance