Best practices for production

The built-in Prometheus server is a great way to gain insight into the performance of your Gloo environment. However, the pod is not set up with persistent storage, and metrics are lost when the pod restarts or when the deployment is scaled down. Additionally, you might want to use your organization's own Prometheus-compatible solution or time series database that is hardened for production and integrates with other applications that might exist outside the cluster where your API Gateway runs.

To set up monitoring for production, you can choose between the following options:

Replace the built-in Prometheus server with your own instance

In this setup, you configure Gloo to disable the built-in Prometheus instance and to use your production Prometheus instance instead. This setup is a reasonable approach if you want to scrape raw Istio metrics from the Gloo management server to collect them in your production Prometheus instance. However, you cannot control the number of metrics that you collect, or federate and aggregate the metrics before you scrape them with your production Prometheus. To query the metrics and compute results, you use the compute resources of the cluster where your production Prometheus instance runs. Note that depending on the number and complexity of the queries that you plan to run in your production Prometheus instance, especially if you use the instance to consolidate metrics of other apps as well, your production instance might get overloaded or start to respond more slowly.

For more information, see Replace the built-in Prometheus server with your own instance.

Recommended: Federate metrics with recording rules and provide them to your production monitoring system

To build a robust production-level Prometheus setup that follows the Istio observability best practices, federate the metrics and cardinality labels that you want to collect in your production instance and use the compute capacity of the Gloo management cluster to aggregate and precompute the metrics. Then, you can scrape the federated metrics with your Prometheus-compatible solution or send them to a time series database that is hardened for production as shown in the following image.

Figure: Production Prometheus setup.

While this is a more complex setup than replacing the built-in Prometheus server with your own instance, you have granular control over the metrics that you want to collect. Because the metrics are precomputed on the Gloo management cluster, your queries in the production instance are much faster and scalable, and you can avoid overloading your production instance.

For more information, see Federate metrics with recording rules and provide them to your production monitoring instance.