Bring your own Prometheus

The built-in Prometheus server is the recommended approach for scraping metrics from Gloo components and feeding them to the Gloo UI Graph to visualize workload communication. When you enable the built-in Prometheus during your installation, it is set up with a custom scraping configuration that ensures that only a minimum set of metrics and metric labels are collected.

However, the Prometheus pod is not set up with persistent storage and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to replace the built-in Prometheus server and use your organization’s own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside the cluster where your API Gateway runs. Review the options that you have for bringing your own Prometheus server.

Forward metrics to the built-in Prometheus in OpenShift

OpenShift comes with built-in Prometheus instances that you can use to monitor metrics for your workloads. Instead of using the built-in Prometheus that Gloo Mesh Core provides, you might want to forward the metrics from the telemetry gateway and collector agents to the OpenShift Prometheus to have a single observability layer for all of your workloads in the cluster.

For more information, see Forward metrics to OpenShift.

Replace the built-in Prometheus with your own

If you have an existing Prometheus instance that you want to use in place of the built-in Prometheus server, you configure Gloo Mesh Core to disable the built-in Prometheus instance and to use your production Prometheus instance instead. This setup is a reasonable approach if you want to scrape raw Istio metrics to collect them in your production Prometheus instance. However, you cannot control the number of metrics that you collect, or federate and aggregate the metrics before you scrape them with your production Prometheus.

To query the metrics and compute results, you use the compute resources of the cluster where your production Prometheus instance runs. Note that depending on the number and complexity of the queries that you plan to run in your production Prometheus instance, especially if you use the instance to consolidate metrics of other apps as well, your production instance might get overloaded or start to respond more slowly.

Run another Prometheus instance alongside the built-in one

You might want to run multiple Prometheus instances in your cluster that each capture metrics for certain components. For example, you might use the built-in Prometheus server in Gloo Mesh Core to capture metrics for the Gloo components, and use a different Prometheus server for your own apps’ metrics. While this setup is supported, make sure that you check the scraping configuration for each of your Prometheus instances to prevent metrics from being scraped multiple times.

Remove high cardinality labels at creation time

To reduce the amount of data that is collected, you can modify how Istio metrics are recorded at creation time. With this setup, you can remove any unwanted cardinality labels before metrics are scraped by the built-in or your own custom Prometheus server.

Use the Istio Telemetry API to customize how metrics are recorded.

  1. Decide on the metrics that you want to remove labels from. To find an overview of the metric selectors that you can modify, see the Istio metric selector reference. You can start by looking at Istio histogram metrics, also referred to as distribution metrics. Histograms show the frequency distribution of data in a certain timeframe. While these metrics provide great insights and detail, they often come with lots of labels that lead to high cardinality. Note that these metric selectors correspond to the list of Istio Prometheus metrics that are collected. For example, the REQUEST_SIZE selector corresponds to the istio_request_bytes metric.

  2. Decide on the labels that you want to remove from the metrics. For an overview of labels that are collected, see Labels. Note that this page lists the labels with their actual names, which you must specify as underscore-separated names in your Telemetry resource. For example, the “Response Flags” label is specified as response_flags.

  3. Decide which mode of the collected metric that you want to modify. For each metric, the mode that defines how the metric is collected for a workload.

    • CLIENT_AND_SERVER: Scenarios in which the workload is either the source or destination of the network traffic.
    • CLIENT: Scenarios in which the workload is the source of the network traffic.
    • SERVER: Scenarios in which the workload is the destination of the network traffic.
  4. Configure an Istio Telemetry resource to remove specific labels. For example, this resource removes the response_flags label from the istio_request_bytes Prometheus metric by using the REQUEST_SIZE metric selector.

      apiVersion: telemetry.istio.io/v1
    kind: Telemetry
    metadata:
      name: remove-labels
      namespace: istio-system
    spec:
      metrics:
        - providers:
            - name: prometheus
          overrides:
            - match:
                mode: CLIENT_AND_SERVER
                metric: REQUEST_SIZE
              tagOverrides:
                response_flags:
                  operation: REMOVE