Service mesh metrics

Metrics provide important information about the performance and health of your service mesh, such as the time a request takes to be routed to your app or endpoints in your mesh that are currently unavailable. You can use these measures to detect failures, troubleshoot bottlenecks, and to find ways to improve the performance and reliability of the services in your mesh.

Available options to collect metrics in Gloo Mesh

Use the Prometheus server that is built in to Gloo Mesh to monitor the health and performance of your service mesh.

Prometheus records and collects multi-dimensional data over time. With this data, you can easily see important health-related information for your service mesh, such as which routes perform well in your service mesh, where you might have a bottleneck, how fast your services respond to requests, or how many requests per second the services process. All data is stored in the Prometheus database and you can use the Prometheus Querying Language (PromQL) to perform complex queries and monitor how metrics change over time. In addition, you can easily set up alerts for when metrics reach a certain threshold.

The Gloo Mesh Prometheus server typically runs in the Gloo Mesh management plane and periodically scrapes the metrics from the Gloo Mesh management server. To view the metrics that are automatically collected for you, you can run PromQL queries in the Prometheus dashboard directly or open the Gloo Mesh UI. For more information about how to open the Prometheus dashboard and sample queries that you can run, see View service mesh metrics.

Gloo Mesh metrics architecture

The following image shows how metrics are sent from the Envoy sidecar proxies in your workload clusters to the Prometheus server in the management cluster. As requests enter and leave a sidecar in your service mesh, metrics are immediately sent to the Gloo Mesh agent via gRPC push procedures. The agent uses gRPC push procedures to forward these metrics to the Gloo Mesh management server in the management cluster where data is enriched. For example, the management server adds the ID of the source and destination workload that you can use to filter the metrics for the workload that you are interested in. Every 15s, the Prometheus server scrapes the metrics from the Gloo Mesh management server. Scraped metrics are available in the Gloo Mesh UI and in the Prometheus dashboard.

Overview of how metrics are sent from the Istio poxies to the Prometheus server

Default Prometheus server configuration

By default, the built-in Prometheus server scrapes the Gloo Mesh management server metrics every 15 seconds and times out after 10 seconds.

You can review the default Prometheus server configuration by running the following command.

kubectl get configmap prometheus-server -n gloo-mesh --context $MGMT_CONTEXT -o yaml

Overview of available metrics in the Gloo Mesh management server

To find a list of metrics that are ready to be scraped by the built-in Prometheus server, you can access the metrics endpoint of the Gloo Mesh management server.

  1. Set up port forwarding for the Gloo Mesh management server.

    kubectl port-forward -n gloo-mesh --context $MGMT_CONTEXT deploy/gloo-mesh-mgmt-server 9091
    
  2. Access available metrics at http://localhost:9091/metrics.

Retention period for metrics

Metrics are available for as long as the prometheus-server pod runs in your management cluster, but are lost between restarts or when you scale down the deployment.

Because metrics are not persisted between pod restarts, make sure to consider best practices for how to set up monitoring with Prometheus in production.

Monitored metrics in the Gloo Mesh UI

The Gloo Mesh UI monitors the following metrics and records how these metrics change over time. You can see and work with these metrics by using the Gloo Mesh UI Graph.

Best practices for collecting metrics in production

The built-in Prometheus server is a great way to gain insight into the performance of your service mesh. However, the pod is not set up with persistent storage, and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to use your organization's own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside of the service mesh.

To set up monitoring for production, you can choose between the following options: