Prometheus server setup
During your installation, you have the option to deploy a Gloo Prometheus server alongside the Gloo OpenTelemetry pipeline that you can use to monitor the health of the Gloo control and data plane. The Prometheus server scrapes metrics from the components in your Gloo environment, such as the Gloo management server and agents, and forwards these metrics to other observability tools, such as the Gloo UI and the Gloo Platform operations dashboard.
Default Prometheus server configuration
The built-in Prometheus server scrapes metrics from various sources, such as the Gloo management server and the Gloo telemetry gateway. Metrics are scraped every 15 seconds. The scraping action times out after 10 seconds if no connection to metrics endpoint could be established or no traffic was received from the metrics endpoint.
While metrics for the Gloo management server are available in Prometheus by default, an additional receiver is required to also make them available to the Gloo telemetry gateway. For more information, see Scrape Gloo management server metrics.
You can view the default Prometheus server configuration by running the following command.
kubectl get configmap prometheus-server -n gloo-mesh -o yaml
Retention period for metrics
Metrics are available for as long as the
prometheus-server pod runs in your management cluster, but are lost between restarts or when you scale down the deployment.
The Prometheus server pod is not set up with persistent storage, and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to replace the built-in Prometheus server and use your organization's own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside the cluster where your API Gateway runs. To reduce the amount of data that is collected, consider removing high cardinality labels for your metrics.
Replace the built-in Prometheus server with your own instance
In this setup, you configure Gloo to disable the built-in Prometheus instance and to use your production Prometheus instance instead. This setup is a reasonable approach if you want to scrape raw Istio metrics from the Gloo management server to collect them in your production Prometheus instance. However, you cannot control the number of metrics that you collect, or federate and aggregate the metrics before you scrape them with your production Prometheus. To query the metrics and compute results, you use the compute resources of the cluster where your production Prometheus instance runs. Note that depending on the number and complexity of the queries that you plan to run in your production Prometheus instance, especially if you use the instance to consolidate metrics of other apps as well, your production instance might get overloaded or start to respond more slowly.
For more information, see Replace the built-in Prometheus server with your own instance.
To have more granular control over the metrics that you want to collect, it is recommended to set up additional receivers, processors, and exporters in the Gloo telemetry pipeline to make these metrics available to the Gloo telemetry gateway. Then, forward these metrics to the third-party solution or time series database of your choice, such as your production Prometheus or Datadog instance. For more information, see Customize the pipeline.
Remove high cardinality labels at creation time
To reduce the amount of data that is collected, you can customize the Envoy filter in the Istio proxy deployment to modify how Istio metrics are recorded at creation time. With this setup, you can remove any unwanted cardinality labels before metrics are scraped by the built-in Prometheus server. For more information, see Remove high cardinality labels at creation time.