Service mesh metrics
Metrics provide important information about the performance and health of your service mesh, such as the time a request takes to be routed to your app or endpoints in your mesh that are currently unavailable. You can use these measures to detect failures, troubleshoot bottlenecks, and to find ways to improve the performance and reliability of the services in your mesh.
Available options to collect metrics in Gloo Mesh
Use the Prometheus server that is built in to Gloo Mesh to monitor the health and performance of your service mesh.
Prometheus records and collects multi-dimensional data over time. With this data, you can easily see important health-related information for your service mesh, such as which routes perform well in your service mesh, where you might have a bottleneck, how fast your services respond to requests, or how many requests per second the services process. All data is stored in the Prometheus database and you can use the Prometheus Querying Language (PromQL) to perform complex queries and monitor how metrics change over time. In addition, you can easily set up alerts for when metrics reach a certain threshold.
The Gloo Mesh Prometheus server typically runs in the Gloo Mesh management plane and periodically scrapes the metrics from the Gloo Mesh management server. To view the metrics that are automatically collected for you, you can run PromQL queries in the Prometheus dashboard directly or open the Gloo Mesh UI. For more information about how to open the Prometheus dashboard and sample queries that you can run, see View service mesh metrics.
Gloo Mesh metrics architecture
The following image shows how metrics are sent from the Envoy sidecar proxies in your workload clusters to the Prometheus server in the management cluster. As requests enter and leave a sidecar in your service mesh, metrics are immediately sent to the Gloo Mesh agent via gRPC push procedures. The agent uses gRPC push procedures to forward these metrics to the Gloo Mesh management server in the management cluster where data is enriched. For example, the management server adds the ID of the source and destination workload that you can use to filter the metrics for the workload that you are interested in. Every 15s, the Prometheus server scrapes the metrics from the Gloo Mesh management server. Scraped metrics are available in the Gloo Mesh UI and in the Prometheus dashboard.
Default Prometheus server configuration
By default, the built-in Prometheus server scrapes the Gloo Mesh management server metrics every 15 seconds and times out after 10 seconds.
You can review the default Prometheus server configuration by running the following command.
kubectl get configmap prometheus-server -n gloo-mesh --context $MGMT_CONTEXT -o yaml
Overview of available metrics in the Gloo Mesh management server
To find a list of metrics that are ready to be scraped by the built-in Prometheus server, you can access the metrics endpoint of the Gloo Mesh management server.
Set up port forwarding for the Gloo Mesh management server.
kubectl port-forward -n gloo-mesh --context $MGMT_CONTEXT deploy/gloo-mesh-mgmt-server 9091
Access available metrics at
Retention period for metrics
Metrics are available for as long as the
prometheus-server pod runs in your management cluster, but are lost between restarts or when you scale down the deployment.
Monitored metrics in the Gloo Mesh UI
The Gloo Mesh UI monitors the following metrics and records how these metrics change over time. You can see and work with these metrics by using the Gloo Mesh UI Graph.
istio_requests_total: This metric is used to determine the number of total requests, successful requests, and requests that failed within your service mesh.
istio_request_duration_milliseconds_bucket: To determine the latency between microservices, the Gloo Mesh UI monitors the milliseconds it takes for a request to reach its destination.
Best practices for collecting metrics in production
The built-in Prometheus server is a great way to gain insight into the performance of your service mesh. However, the pod is not set up with persistent storage, and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to use your organization's own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside of the service mesh.
To set up monitoring for production, you can choose between the following options:
Replace the built-in Prometheus server with your own instance
In this setup, you configure Gloo Mesh to disable the built-in Prometheus instance and to use your production Prometheus instance instead. This setup is a reasonable approach if you want to scrape raw Istio metrics from the Gloo Mesh management server to collect them in your production Prometheus instance. However, you cannot control the number of metrics that you collect, or federate and aggregate the metrics before you scrape them with your production Prometheus. To query the metrics and compute results, you use the compute resources of the cluster where your production Prometheus instance runs. Note that depending on the number and complexity of the queries that you plan to run in your production Prometheus instance, especially if you use the instance to consolidate metrics of other apps as well, your production instance might get overloaded or start to respond more slowly.
For more information, see Replace the built-in Prometheus server with your own instance.
Recommended: Locally federate metrics and provide them to your production monitoring system
To build a robust production-level Prometheus setup that follows the Istio observability best practices, federate the metrics and cardinality labels that you want to collect in your production instance and use the compute capacity of the Gloo Mesh management cluster to aggregate and precompute the metrics. Then, you can scrape the federated metrics with your Prometheus-compatible solution or send them to a time series database that is hardened for production as shown in the following image.
While this is a more complex setup than [replacing the built-in Prometheus server with your own instance] /gloo-mesh-enterprise/latest/observability/metrics/#custom-prometheus ), you have granular control over the metrics that you want to collect. Because the metrics are precomputed on the Gloo Mesh management cluster, your queries in the production instance are much faster and scalable, and you can avoid overloading your production instance.
For more information, see Locally federate metrics and scrape them with your production Prometheus instance.