Best practices for production
The built-in Prometheus server is a great way to gain insight into the performance of your service mesh. However, the pod is not set up with persistent storage, and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to use your organization's own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside of the service mesh.
To set up monitoring for production, you can choose between the following options:
Replace the built-in Prometheus server with your own instance
In this setup, you configure Gloo Mesh to disable the built-in Prometheus instance and to use your production Prometheus instance instead. This setup is a reasonable approach if you want to scrape raw Istio metrics from the Gloo Mesh management server to collect them in your production Prometheus instance. However, you cannot control the number of metrics that you collect, or federate and aggregate the metrics before you scrape them with your production Prometheus. To query the metrics and compute results, you use the compute resources of the cluster where your production Prometheus instance runs. Note that depending on the number and complexity of the queries that you plan to run in your production Prometheus instance, especially if you use the instance to consolidate metrics of other apps as well, your production instance might get overloaded or start to respond more slowly.
For more information, see Replace the built-in Prometheus server with your own instance.
Recommended: Federate metrics with recording rules and provide them to your production monitoring system
To build a robust production-level Prometheus setup that follows the Istio observability best practices, federate the metrics and cardinality labels that you want to collect in your production instance and use the compute capacity of the Gloo Mesh management cluster to aggregate and precompute the metrics. Then, you can scrape the federated metrics with your Prometheus-compatible solution or send them to a time series database that is hardened for production as shown in the following image.
While this is a more complex setup than replacing the built-in Prometheus server with your own instance, you have granular control over the metrics that you want to collect. Because the metrics are precomputed on the Gloo Mesh management cluster, your queries in the production instance are much faster and scalable, and you can avoid overloading your production instance.
For more information, see Federate metrics with recording rules and provide them to your production monitoring instance.
Remove high cardinality labels at creation time:
With metrics federation, you can use recording rules to precompute frequently used metrics and reduce high cardinality labels before metrics are forwarded to an external Prometheus-compatible solution. The raw labels and metric dimensions are still available in the built-in Prometheus server and can be accessed if needed.
To reduce the amount of data that is collected even more, you can customize the Envoy filter of your workloads to modify how Istio metrics are recorded at creation time. With this setup, you can remove any unwanted cardinality labels before metrics are scraped by the built-in Prometheus server.
For more information, see Remove high cardinality labels at creation time.