Service mesh metrics

Metrics provide important information about the performance and health of your service mesh, such as the time a request takes to be routed to your app or endpoints in your mesh that are currently unavailable. You can use these measures to detect failures, troubleshoot bottlenecks, and to find ways to improve the performance and reliability of the services in your mesh.

Available options to collect metrics in Gloo Mesh

When you install Gloo Mesh with the default settings, a Prometheus server is automatically set up for you and configured to periodically scrape metrics from the Gloo Mesh management server.

Prometheus records and collects multi-dimensional data over time. With this data, you can easily see important health-related information for your service mesh, such as which routes perform well in your service mesh, where you might have a bottleneck, how fast your services respond to requests, or how many requests per second the services process. All data is stored in the Prometheus database and you can use the Prometheus Querying Language (PromQL) to perform complex queries and monitor how metrics change over time. In addition, you can easily set up alerts for when metrics reach a certain threshold.

To view service mesh metrics, you can choose between the following options:

The Gloo Mesh Prometheus server typically runs in the Gloo Mesh management plane and periodically scrapes the metrics from the Gloo Mesh management server. To view the metrics that are automatically collected for you, you can run PromQL queries in the Prometheus dashboard directly or open the Gloo Mesh UI.

Gloo Mesh metrics architecture

The following image shows how metrics are sent from the Envoy sidecar proxies in your workload clusters to the Prometheus server in the management cluster. As requests enter and leave a sidecar in your service mesh, metrics are immediately sent to the Gloo Mesh agent via gRPC push procedures. The agent uses gRPC push procedures to forward these metrics to the Gloo Mesh management server in the management cluster where data is enriched. For example, the management server adds the ID of the source and destination workload that you can use to filter the metrics for the workload that you are interested in. Every 15s, the built-in Prometheus server scrapes the metrics from the Gloo Mesh management server. Scraped metrics are available in the Gloo Mesh UI and in the Prometheus UI.

Overview of how metrics are sent from the Istio proxies to the Prometheus server

Default Prometheus server configuration

By default, the built-in Prometheus server scrapes the Gloo Mesh management server metrics every 15 seconds and times out after 10 seconds.

You can review the default Prometheus server configuration by running the following command.

kubectl get configmap prometheus-server -n gloo-mesh --context $MGMT_CONTEXT -o yaml

Overview of available metrics in the Gloo Mesh management server

To find a list of metrics that are ready to be scraped by the built-in Prometheus server, you can access the metrics endpoint of the Gloo Mesh management server.

Prometheus annotations are automatically added during the Istio installation to enable scraping of metrics for the Istio control plane (istiod), ingress, and proxy pods. These metrics are automatically merged with app metrics and made available to the Gloo Mesh management server. For more information about these annotations and how you can disable them, see the Istio documentation.

  1. Open the built-in Prometheus dashboard.
  2. Review available metrics at http://localhost:9091/metrics.

Retention period for metrics

Metrics are available for as long as the prometheus-server pod runs in your management cluster, but are lost between restarts or when you scale down the deployment.

Because metrics are not persisted between pod restarts, make sure to consider best practices for how to set up monitoring with Prometheus in production.

Monitored metrics in the Gloo Mesh UI

The Gloo Mesh UI monitors the following metrics and records how these metrics change over time. You can see and work with these metrics by using the Gloo Mesh UI Graph.

Best practices for collecting metrics in production

The built-in Prometheus server is a great way to gain insight into the performance of your service mesh. However, the pod is not set up with persistent storage, and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to use your organization's own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside of the service mesh.

To set up monitoring for production, you can choose between the following options: