Prepare for production

The built-in Prometheus server is a great way to gain insight into the performance of your service mesh. However, the pod is not set up with persistent storage and metrics are lost when the pod restarts or when the deployment is scaled down. A lot of organizations also run their own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside of the service mesh.

To build a production-level Prometheus setup, you can choose between the following options:

To read more about each option, see Best practices for collecting metrics in production.

Prometheus annotations are automatically added during the Istio installation to enable scraping of metrics for the Istio control plane (istiod), ingress, and proxy pods. These metrics are automatically merged with app metrics and made available to the Gloo management server. For more information about these annotations and how you can disable them, see the Istio documentation.

Replace the built-in Prometheus server with your own instance

In this setup, you disable the built-in Prometheus server and configure Gloo to use your production Prometheus instance instead.

  1. Configure Gloo to disable the default Prometheus instance and instead connect to your custom Prometheus server. Make sure that the instance runs Prometheus version 2.16.0 or later. In the prometheusUrl field, enter the Prometheus URL that your instance is exposed on, such as http://kube-prometheus-stack-prometheus.monitoring:9090. You can get this value from the --web.external-url field in your Prometheus Helm values file or by selecting Status > Command-Line-Flags from the Prometheus UI. Do not use the FQDN for the Prometheus URL.

    helm upgrade --install gloo-platform gloo-platform/gloo-platform \
       --namespace gloo-mesh \
       --version $GLOO_VERSION \
       --values mgmt-server.yaml \
       --set common.cluster=$MGMT_CLUSTER \
       --set licensing.glooMeshLicenseKey=$GLOO_MESH_LICENSE_KEY \
       --set prometheus.enabled=false \
       --set common.prometheusUrl=<Prometheus_server_URL_and_port>
    

    If you installed Gloo Mesh using the gloo-mesh-enterpise, gloo-mesh-agent, and other included Helm charts, or by using meshctl version 2.2 or earlier, these Helm charts are considered legacy. Migrate your legacy installation to the new gloo-platform Helm chart.

    helm upgrade --install gloo-mgmt gloo-mesh-enterprise/gloo-mesh-enterprise \
       --namespace gloo-mesh \
       --version $GLOO_VERSION \
       --values values-mgmt-plane-env.yaml \
       --set prometheus.enabled=false \
       --set prometheusUrl=<Prometheus_server_URL_and_port> \
       --set glooMeshLicenseKey=${GLOO_MESH_LICENSE_KEY} \
       --set global.cluster=$MGMT_CLUSTER
    

    Make sure to include your Helm values when you upgrade either as a configuration file in the –values flag or with –set flags. Otherwise, any previous custom values that you set might be overwritten. In single cluster setups, this might mean that your Gloo agent and ingress gateways are removed. To get your current values, such as for a release named gloo-platform, you can run helm get values gloo-platform -n gloo-mesh > gloo-gateway-single.yaml. For more information, see Get your Helm chart values in the upgrade guide.

  2. Configure your Prometheus server to scrape metrics from the Gloo management server endpoint gloo-mesh-mgmt-server.gloo-mesh-admin:9091. This setup might vary depending on the Prometheus server that you use. For example, if you use the Prometheus Community Chart, update the Helm values.yaml file as follows to scrape metrics from the Gloo management server.

    serverFiles:
      prometheus.yml:
        scrape_configs:
        - job_name: gloo-mesh
          scrape_interval: 15s
          scrape_timeout: 10s
          static_configs:
          - targets:
            - gloo-mesh-mgmt-server-admin.gloo-mesh:9091
    

Recommended: Federate metrics with recording rules and provide them to your production monitoring instance

In this setup, you inject recording rules in to the built-in Prometheus server in Gloo to federate the metrics that you want and reduce high cardinality labels. Then, you set up another Prometheus instance in the Gloo management cluster to scrape the federated metrics. You can optionally forward the federated metrics to a Prometheus-compatible solution or a time series database that sits outside of your Gloo management cluster and is hardened for production.

Before you begin, make sure that you installed Gloo with the default Helm values to set up the built-in Prometheus server. If you did not set up the built-in Prometheus server, upgrade your existing installation and set the prometheus.enabled Helm value to true.

  1. Get the configuration of the built-in Prometheus server in Gloo and save it to a local file on your machine.

    kubectl get configmap prometheus-server -n gloo-mesh --context $MGMT_CONTEXT -o yaml > config.yaml
    
  2. Review the metrics that are sent to the built-in Prometheus server by default.

    1. Set up port forwarding for the metrics endpoint of your Gloo management server to your local host.

      kubectl port-forward -n gloo-mesh --context $MGMT_CONTEXT deploy/gloo-mesh-mgmt-server 9091
      
    2. View the metrics that are collected by default.

      open http://localhost:9091/metrics
      
    3. Decide on the subset of metrics that you want to federate.

  3. Add a recording rule to the configmap of your Gloo Prometheus instance that you retrieved earlier to define how you want to aggregate the metrics. Recording rules let you precompute frequently needed or computationally expensive expressions. For example, you can remove high cardinality labels and federate only the labels that you need in future dashboards or alert queries. The results are saved in a new set of time series that you can later scrape or send to an external monitoring instance that is hardened for production. With this setup, you can protect your production instance as you send only the metrics that you need. In addition, you use the compute resources in the Gloo management cluster to prepare and aggregate the metrics.

    In this example, you use the istio_requests_total metric to record the total number of requests at the workload level in your service mesh. As part of this aggregation, pod labels are removed as they might lead to cardinality issues in certain environments. The result is saved as the workload:istio_requests_total metric to make sure that you can distinguish the original istio_requests_total metric from the aggregated one.

    apiVersion: v1
    data:
      alerting_rules.yml: |
        {}
      alerts: |
        {}
      prometheus.yml: |
      ...
      recording_rules.yml: |
        groups:
        - name: istio.workload.istio_requests_total
          interval: 10s
          rules:
          - record: workload:istio_requests_total
            expr: |
              sum(istio_requests_total{source_workload!=""})
              by (
                source_workload,
                source_workload_namespace,
                destination_service,
                source_app,
                destination_app,
                destination_workload,
                destination_workload_namespace,
                response_code,
                response_flags,
                reporter
              )
      rules: |
        {}
    kind: ConfigMap
    ...
       

  4. Deploy another Prometheus instance in the Gloo management cluster to scrape the federated metrics from the Gloo Prometheus instance.

    1. Create the monitoring namespace in the Gloo management cluster.
      kubectl create namespace monitoring --context $MGMT_CONTEXT
      
    2. Add the Prometheus community Helm repository.
      helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
      
    3. Install the Prometheus community chart.
      helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --version 30.0.1 -f values.yaml --kube-context ${MGMT_CONTEXT} -n monitoring --debug
      
    4. Verify that the Prometheus pods are running.
      kubectl get pods -n monitoring --context $MGMT_CONTEXT
      
  5. Add a service monitor to the Prometheus instance that you just created to scrape the aggregated metrics from the Gloo Prometheus instance and to expose them on the /federate endpoint.

    In the following example, metrics from the Gloo Prometheus instance that match the 'workload:(.*)' regex expression are scraped. With the recording rule that you defined earlier, workload:istio_requests_total is the only metric that matches this criteria. The service monitor configuration also removes workload: from the metric name so that it is displayed as the istio_requests_total metric in Prometheus queries. To access the aggregated metrics that you scraped, you send a request to the /federate endpoint and provide match[]={__name__=<metric>} as a request parameter.

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: gloo-metrics-federation
      namespace: monitoring
      labels:
        app.kubernetes.io/name: gloo-prometheus
    spec:
      namespaceSelector:
        matchNames:
        - gloo-mesh
      selector:
        matchLabels:
          app: prometheus
      endpoints:
      - interval: 30s
        scrapeTimeout: 30s
        params:
          'match[]':
          - '{__name__=~"workload:(.*)"}'
        path: /federate
        targetPort: 9090
        honorLabels: true
        metricRelabelings:
        - sourceLabels: ["__name__"]
          regex: 'workload:(.*)'
          targetLabel: "__name__"
          action: replace
    
  6. Access the /federate endpoint to see the scraped metrics. Note that you must include the match[]={__name__=<metric>} request parameter to successfully see the aggregated metrics.

    1. Port forward the Prometheus service so that you can access the Prometheus UI on your local machine.

      kubectl port-forward service/kube-prometheus-stack-prometheus --context $MGMT_CONTEXT -n monitoring 9090
      
    2. Open the targets that are configured for your Prometheus instance.

      open https://localhost:9090/targets
      
    3. Select the gloo-metrics-federation target that you configured and verify that the endpoint address and match condition are correct, and that the State displays as UP.

      Gloo federation target

    4. Optional: Access the aggregated metrics on the /federate endpoint.

      open https://localhost:9090/federate?match[]={__name__="istio_requests_total"}
      

      Example output:

      # TYPE istio_requests_total untyped
      istio_requests_total{container="prometheus-server",destination_app="ratings",destination_service="ratings.bookinfo.svc.cluster.local",destination_workload="ratings-v1",destination_workload_namespace="bookinfo",endpoint="9090",job="prometheus-server",namespace="gloo-mesh",pod="prometheus-server-647b488bb-ns748",reporter="destination",response_code="200",response_flags="-",service="prometheus-server",source_app="istio-ingressgateway",source_workload="istioingressgateway",source_workload_namespace="istio-system",instance="",prometheus="monitoring/kube-prometheus-stack-prometheus",prometheus_replica="prometheus-kube-prometheus-stack-prometheus-0"} 11 1654888576995
      istio_requests_total{container="prometheus-server",destination_app="ratings",destination_service="ratings.bookinfo.svc.cluster.local",destination_workload="ratings-v1",destination_workload_namespace="bookinfo",endpoint="9090",job="prometheus-server",namespace="gloo-mesh",pod="prometheus-server-647b488bb-ns748",reporter="source",response_code="200",response_flags="-",service="prometheus-server",source_app="istio-ingressgateway",source_workload="istio-ingressgateway",source_workload_namespace="istio-system",instance="",prometheus="monitoring/kube-prometheus-stack-prometheus",prometheus_replica="prometheus-kube-prometheus-stack-prometheus-0"} 11 1654888576995
      
  7. Forward the federated metrics to your external Prometheus-compatible solution or time series database that is hardened for production. Refer to the Prometheus documentation to explore your forwarding options or try out the Prometheus agent mode.

Remove high cardinality labels at creation time

With metrics federation, you can use recording rules to pre-compute frequently used metrics and reduce high cardinality labels before metrics are forwarded to an external Prometheus-compatible solution. The raw labels and metric dimensions are still available in the built-in Prometheus server and can be accessed if needed.

To reduce the amount of data that is collected even more, you can customize the Envoy filter of your workloads to modify how Istio metrics are recorded at creation time. With this setup, you can remove any unwanted cardinality labels before metrics are scraped by the built-in Prometheus server.

Make sure to only remove labels that you do not need in any of your production queries, alerts, or dashboards. After you apply the Envoy filter, high cardinality labels are permanently removed and cannot be recovered later. If you are not sure that you might need any of the labels later, follow the recommendations in Federate metrics with recording rules and provide them to your production monitoring instance to aggregate the metrics that you need.

  1. Decide which context of the Istio Envoy filter you want to modify. Each Istio release includes an Envoy filter that is named stats-filter-<istio_version> and that defines how metrics are collected for your workloads. Depending on whether you modify the Envoy filter directly or use the Istio Helm chart to configure the filter, you can choose between the following contexts:

    • SIDECAR_INBOUND or inboundSidecar: Used to collect metrics for traffic that is sent to a destination (reporter=destination).
    • SIDECAR_OUTBOUND or outboundSidecar: Used to collect metrics for traffic that leaves a microservice (reporter=source).
    • GATEWAY or gateway: Used to collect metrics for traffic that passes through the ingress gateway.
  2. Decide on the metric labels you want to remove with your custom Envoy filter. To find an overview of metrics that are collected by default, see the Istio documentation. For an overview of labels that are collected, see Labels. You can start by looking at Istio histogram metrics, also referred to as distribution metrics. Histograms show the frequency distribution of data in a certain timeframe. While these metrics provide great insights and detail, they often come with lots of labels that lead to high cardinality.

    Removing labels from histograms can significantly reduce cardinality and the amount of data that you collect. For example, you might want to keep all the labels, including the high cardinality labels of the istio_request_duration_milliseconds metric to monitor request latency for your workloads. However, collecting the same high cardinality labels in histograms such as istio_request_bytes_bucket or istio_response_byte_bucket might not be important for your environment.

  3. Configure your Envoy filter to remove specific labels. To apply the same configuration across all of your Istio microservices, modify the filter in the Istio Helm chart. If you want to update the configuration for a particular workload only, you can patch the Envoy filter instead.

    To find the name of the metric that you need to use in your filter configuration, see Metrics. Note that you must remove the istio_ prefix from the metric name before you add it to your filter configuration. For example, if you want to customize the request size metric, use request_bytes. To find an overview of available labels that you can remove, see Labels. Note that this page lists the labels with their actual names and not as the value that you need to provide in the Envoy filter or Helm chart. To find the corresponding label name value, refer to the Istio bootstrap config for your release.

    Upgrade your Helm installation and add the Envoy filter configuration.

    helm --kube-context=${REMOTE_CONTEXT} upgrade --install istio-1.17.2 ./istio-1.17.2/manifests/charts/istio-control/istio-discovery -n istio-system --values - <<EOF
    global:
      ...
    meshConfig:
      ...
    pilot:
      ...
    telemetry:
      v2:
        prometheus:
          configOverride:
            outboundSidecar:
              metrics:
              - name: request_bytes
                tags_to_remove:
                - destination_service
                - response_flags
              - name: response_bytes
                tags_to_remove:
                - destination_service
                - response_flags
            inboundSidecar:
              disable_host_header_fallback: true
              metrics:
              - name: request_bytes
                tags_to_remove:
                - destination_service
                - response_flags
              - name: response_bytes
                tags_to_remove:
                - destination_service
                - response_flags
            gateway:
              disable_host_header_fallback: true
              metrics:
              - name: request_bytes
                tags_to_remove:
                - destination_service
                - response_flags
              - name: response_bytes
                tags_to_remove:
                - destination_service
                - response_flags
    EOF
    

    In the following example, the Envoy filter for the productpage service from the Istio Bookinfo app is modified. All other workloads in the cluster continue to use the default Istio Envoy configuration. Note that this example is specific to Istio version 1.14. If you use a different Istio version, refer to the Istio Envoy documentation.

    apiVersion: networking.istio.io/v1alpha3
    kind: EnvoyFilter
    metadata:
      name: stats-filter-1.14-productpage
      namespace: bookinfo-frontends
    spec:
      workloadSelector:
        labels:
          app: productpage
          version: v1
      configPatches:
      - applyTo: HTTP_FILTER
        match:
          context: SIDECAR_OUTBOUND
          listener:
            filterChain:
              filter:
                name: envoy.filters.network.http_connection_manager
                subFilter:
                  name: envoy.filters.http.router
          proxy:
            proxyVersion: ^1\.14.*
        patch:
          operation: INSERT_BEFORE
          value:
            name: istio.stats
            typed_config:
              '@type': type.googleapis.com/udpa.type.v1.TypedStruct
              type_url: type.googleapis.com/envoy.extensions.filters.http.wasm.v3.Wasm
              value:
                config:
                  configuration:
                    '@type': type.googleapis.com/google.protobuf.StringValue
                    value: |
                      {"metrics":[{"name":"request_bytes","tags_to_remove":["destination_service","response_flags"]},{"name":"response_bytes","tags_to_remove":["destination_service","response_flags"]}]}
                  root_id: stats_outbound
                  vm_config:
                    code:
                      local:
                        inline_string: envoy.wasm.stats
                    runtime: envoy.wasm.runtime.null
                    vm_id: stats_outbound
      - applyTo: HTTP_FILTER
        match:
          context: SIDECAR_INBOUND
          listener:
            filterChain:
              filter:
                name: envoy.filters.network.http_connection_manager
                subFilter:
                  name: envoy.filters.http.router
          proxy:
            proxyVersion: ^1\.14.*
        patch:
          operation: INSERT_BEFORE
          value:
            name: istio.stats
            typed_config:
              '@type': type.googleapis.com/udpa.type.v1.TypedStruct
              type_url: type.googleapis.com/envoy.extensions.filters.http.wasm.v3.Wasm
              value:
                config:
                  configuration:
                    '@type': type.googleapis.com/google.protobuf.StringValue
                    value: |
                      {"disable_host_header_fallback":true,"metrics":[{"name":"request_bytes","tags_to_remove":["destination_service","response_flags"]},{"name":"response_bytes","tags_to_remove":["destination_service","response_flags"]}]}
                  root_id: stats_inbound
                  vm_config:
                    code:
                      local:
                        inline_string: envoy.wasm.stats
                    runtime: envoy.wasm.runtime.null
                    vm_id: stats_inbound