Service mesh metrics
Metrics provide important information about the performance and health of your service mesh, such as the time a request takes to be routed to your app or endpoints in your mesh that are currently unavailable. You can use these measures to detect failures, troubleshoot bottlenecks, and to find ways to improve the performance and reliability of the services in your mesh.
Available options to collect metrics in Gloo Mesh
When you install Gloo Mesh with the default settings, a Prometheus server is automatically set up for you and configured to periodically scrape metrics from the Gloo Mesh management server.Prometheus records and collects multi-dimensional data over time. With this data, you can easily see important health-related information for your service mesh, such as which routes perform well in your service mesh, where you might have a bottleneck, how fast your services respond to requests, or how many requests per second the services process. All data is stored in the Prometheus database and you can use the Prometheus Querying Language (PromQL) to perform complex queries and monitor how metrics change over time. In addition, you can easily set up alerts for when metrics reach a certain threshold.
To view service mesh metrics, you can choose between the following options:
- Built-in Prometheus server: You can use the built-in Prometheus server to run PromQL queries and view service mesh metrics in the Prometheus UI. For more information, see the Gloo Mesh metrics architecture. To find information about how to open the built-in Prometheus UI and view metrics, see Use the built-in Prometheus server
- Gloo Mesh UI: Some of the metrics that are scraped from the Gloo Mesh management server are available to you in the Gloo Mesh UI. For more information, see Monitored metrics in the Gloo Mesh UI.
- Best practices for production: Explore best practices for metrics collection, such as to federate metrics and remove high cardinality labels. For more information, see Best practices for collecting metrics in production.
The Gloo Mesh Prometheus server typically runs in the Gloo Mesh management plane and periodically scrapes the metrics from the Gloo Mesh management server. To view the metrics that are automatically collected for you, you can run PromQL queries in the Prometheus dashboard directly or open the Gloo Mesh UI.
Gloo Mesh metrics architecture
The following image shows how metrics are sent from the Envoy sidecar proxies in your workload clusters to the Prometheus server in the management cluster. As requests enter and leave a sidecar in your service mesh, metrics are immediately sent to the Gloo Mesh agent via gRPC push procedures. The agent uses gRPC push procedures to forward these metrics to the Gloo Mesh management server in the management cluster where data is enriched. For example, the management server adds the ID of the source and destination workload that you can use to filter the metrics for the workload that you are interested in. Every 15s, the built-in Prometheus server scrapes the metrics from the Gloo Mesh management server. Scraped metrics are available in the Gloo Mesh UI and in the Prometheus UI.
Default Prometheus server configuration
By default, the built-in Prometheus server scrapes the Gloo Mesh management server metrics every 15 seconds and times out after 10 seconds.
You can review the default Prometheus server configuration by running the following command.
kubectl get configmap prometheus-server -n gloo-mesh --context $MGMT_CONTEXT -o yaml
Overview of available metrics in the Gloo Mesh management server
To find a list of metrics that are ready to be scraped by the built-in Prometheus server, you can access the metrics endpoint of the Gloo Mesh management server.
Prometheus annotations are automatically added during the Istio installation to enable scraping of metrics for the Istio control plane (
istiod), ingress, and proxy pods. These metrics are automatically merged with app metrics and made available to the Gloo Mesh management server. For more information about these annotations and how you can disable them, see the Istio documentation.
- Open the built-in Prometheus dashboard.
- Review available metrics at http://localhost:9091/metrics.
Retention period for metrics
Metrics are available for as long as the
prometheus-server pod runs in your management cluster, but are lost between restarts or when you scale down the deployment.
Monitored metrics in the Gloo Mesh UI
The Gloo Mesh UI monitors the following metrics and records how these metrics change over time. You can see and work with these metrics by using the Gloo Mesh UI Graph.
istio_requests_total: This metric is used to determine the number of total requests, successful requests, and requests that failed within your service mesh.
istio_request_duration_milliseconds_bucket: To determine the latency between microservices, the Gloo Mesh UI monitors the milliseconds it takes for a request to reach its destination.
Best practices for collecting metrics in production
The built-in Prometheus server is a great way to gain insight into the performance of your service mesh. However, the pod is not set up with persistent storage, and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to use your organization's own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside of the service mesh.
To set up monitoring for production, you can choose between the following options:
Replace the built-in Prometheus server with your own instance
In this setup, you configure Gloo Mesh to disable the built-in Prometheus instance and to use your production Prometheus instance instead. This setup is a reasonable approach if you want to scrape raw Istio metrics from the Gloo Mesh management server to collect them in your production Prometheus instance. However, you cannot control the number of metrics that you collect, or federate and aggregate the metrics before you scrape them with your production Prometheus. To query the metrics and compute results, you use the compute resources of the cluster where your production Prometheus instance runs. Note that depending on the number and complexity of the queries that you plan to run in your production Prometheus instance, especially if you use the instance to consolidate metrics of other apps as well, your production instance might get overloaded or start to respond more slowly.
For more information, see Replace the built-in Prometheus server with your own instance.
Recommended: Federate metrics with recording rules and provide them to your production monitoring system
To build a robust production-level Prometheus setup that follows the Istio observability best practices, federate the metrics and cardinality labels that you want to collect in your production instance and use the compute capacity of the Gloo Mesh management cluster to aggregate and precompute the metrics. Then, you can scrape the federated metrics with your Prometheus-compatible solution or send them to a time series database that is hardened for production as shown in the following image.
While this is a more complex setup than replacing the built-in Prometheus server with your own instance, you have granular control over the metrics that you want to collect. Because the metrics are precomputed on the Gloo Mesh management cluster, your queries in the production instance are much faster and scalable, and you can avoid overloading your production instance.
For more information, see Federate metrics with recording rules and provide them to your production monitoring instance.
Remove high cardinality labels at creation time:
With metrics federation, you can use recording rules to precompute frequently used metrics and reduce high cardinality labels before metrics are forwarded to an external Prometheus-compatible solution. The raw labels and metric dimensions are still available in the built-in Prometheus server and can be accessed if needed.
To reduce the amount of data that is collected even more, you can customize the Envoy filter of your workloads to modify how Istio metrics are recorded at creation time. With this setup, you can remove any unwanted cardinality labels before metrics are scraped by the built-in Prometheus server.
For more information, see Remove high cardinality labels at creation time.