Alerts
Review alerts for Gloo Mesh Core components that are automatically set up for you in Prometheus.
To monitor the Gloo Mesh Core components more easily, Gloo automatically sets up alerts for certain metrics and observes these metrics to notify you if issues occur.
These metrics include:
- Latency: The time it takes to translate or reconcile Gloo resources in your environment.
- Gloo agents: Monitors the connection between the Gloo mangement server and workload clusters.
- Translation errors: Reports the Gloo resources that cannot be correctly translated into Istio or Cilium resources.
- Redis errors: Lists connection failures between the Gloo management server and the Redis database where all of the Gloo configuration is stored.
View default alerts
Get the secret that holds the Prometheus server configuration.
kubectl get secret gloo-prometheus-server -n gloo-mesh -o yaml
Example output:
apiVersion: v1 data: alerting_rules.yml: Z3JvdXBzOgotIG5hbWU6IEdsb29QbGF0Zm9ybUFsZXJ0cwogIHJ1bGVzOgogIC0gYWxlcnQ6IEdsb29QbGF0Zm9ybVRyYW5zbGF0aW9uTGF0ZW5jeUlzSGlnaAogICAgYW5ub3RhdGlvbnM6CiAgICAgIHJ1bmJvb2s6IGh0dHBzOi8vZG9jcy5zb2xvLmlvL2dsb28tbWVzaC1lbnRlcnByaXNlL21haW4vdHJvdWJsZXNob290aW5nL2dsb28vCiAgICAgIHN1bW1hcnk6IFRoZSB0cmFuc2xhdGlvbiB0aW1lIGhhcyBpbmNyZWFzZWQgYWJvdmUgMTAgc2VjLiBJdCdzIGN1cnJlbnRseSB7eyAkdmFsdWUgfCBodW1hbml6ZSB9fS4KICAgIGV4cHI6IGhpc3RvZ3JhbV9xdWFudGlsZSgwLjk5LCBzdW0ocmF0ZShnbG9vX21lc2hfdHJhbnNsYXRpb25fdGltZV9zZWNfYnVja2V0WzVtXSkpIGJ5KGxlKSkgPiAxMAogICAgZm9yOiAxNW0KICAgIGxhYmVsczoKICAgICAgc2V2ZXJpdHk6IHdhcm5pbmcKICAtIGFsZXJ0OiBHbG9vUGxhdGZvcm1SZWNvbnNjaWxlckxhdGVuY3lJc0hpZ2gKICAgIGFubm90YXRpb25zOgogICAgICBydW5ib29rOiBodHRwczovL2RvY3Muc29sby5pby9nbG9vLW1lc2gtZW50ZXJwcmlzZS9tYWluL3Ryb3VibGVzaG9vdGluZy9nbG9vLwogICAgICBzdW1tYXJ5OiBUaGUgcmVjb25jaWxpYXRpb24gdGltZSBoYXMgaW5jcmVhc2VkIGFib3ZlIDgwIHNlYy4gSXQncyBjdXJyZW50bHkge3sgJHZhbHVlIHwgaHVtYW5pemUgfX0uCiAgICBleHByOiBoaXN0b2dyYW1fcXVhbnRpbGUoMC45OSwgc3VtKHJhdGUoZ2xvb19tZXNoX3JlY29uY2lsZXJfdGltZV9zZWNfYnVja2V0WzVtXSkpIGJ5KGxlKSkgPiA4MAogICAgZm9yOiAxNW0KICAgIGxhYmVsczoKICAgICAgc2V2ZXJpdHk6IHdhcm5pbmcKICAtIGFsZXJ0OiBHbG9vUGxhdGZvcm1BZ2VudHNBcmVEaXNjb25uZWN0ZWQKICAgIGFubm90YXRpb25zOgogICAgICBydW5ib29rOiBodHRwczovL2RvY3Muc29sby5pby9nbG9vLW1lc2gtZW50ZXJwcmlzZS9tYWluL3Ryb3VibGVzaG9vdGluZy9nbG9vLwogICAgICBzdW1tYXJ5OiAnVGhlIGZvbGxvd2luZyBjbHVzdGVyIGlzIGRpc2Nvbm5lY3RlZDoge3sgJGxhYmVscy5jbHVzdGVyIH19LiBDaGVjayB0aGUgR2xvbyBQbGF0Zm9ybSBBZ2VudCBwb2QgaW4gdGhlIGNsdXN0ZXIhJwogICAgZXhwcjogY291bnQgYnkoY2x1c3RlcikgKHN1bSBieShjbHVzdGVyKSAocmVsYXlfcHVzaF9jbGllbnRzX3dhcm1lZCA9PSAwKSkgPiAwCiAgICBmb3I6IDVtCiAgICBsYWJlbHM6CiAgICAgIHNldmVyaXR5OiB3YXJuaW5nCiAgLSBhbGVydDogR2xvb1BsYXRmb3JtVHJhbnNsYXRpb25XYXJuaW5ncwogICAgYW5ub3RhdGlvbnM6CiAgICAgIHJ1bmJvb2s6IGh0dHBzOi8vZG9jcy5zb2xvLmlvL2dsb28tbWVzaC1lbnRlcnByaXNlL21haW4vdHJvdWJsZXNob290aW5nL2dsb28vCiAgICAgIHN1bW1hcnk6IEdsb28gUGxhdGZvcm0gaGFzIGRldGVjdGVkIHt7ICR2YWx1ZSB8IGh1bWFuaXplIH19IHRyYW5zbGF0aW9uIHdhcm5pbmdzIGluIHRoZSBsYXN0IDVtLiBDaGVjayB5b3VyIHt7ICRsYWJlbHMuZ3ZrIH19IHJlc291cmNlcyEKICAgIGV4cHI6IGluY3JlYXNlKHRyYW5zbGF0aW9uX3dhcm5pbmdbNW1dKSA+IDAKICAgIGxhYmVsczoKICAgICAgc2V2ZXJpdHk6IHdhcm5pbmcKICAtIGFsZXJ0OiBHbG9vUGxhdGZvcm1UcmFuc2xhdGlvbkVycm9ycwogICAgYW5ub3RhdGlvbnM6CiAgICAgIHJ1bmJvb2s6IGh0dHBzOi8vZG9jcy5zb2xvLmlvL2dsb28tbWVzaC1lbnRlcnByaXNlL21haW4vdHJvdWJsZXNob290aW5nL2dsb28vCiAgICAgIHN1bW1hcnk6IEdsb28gUGxhdGZvcm0gaGFzIGRldGVjdGVkIHt7ICR2YWx1ZSB8IGh1bWFuaXplIH19IHRyYW5zbGF0aW9uIGVycm9ycyBpbiB0aGUgbGFzdCA1bS4gQ2hlY2sgeW91ciB7eyAkbGFiZWxzLmd2ayB9fSByZXNvdXJjZXMhCiAgICBleHByOiBpbmNyZWFzZSh0cmFuc2xhdGlvbl9lcnJvcls1bV0pID4gMAogICAgbGFiZWxzOgogICAgICBzZXZlcml0eTogd2FybmluZwogIC0gYWxlcnQ6IEdsb29QbGF0Zm9ybVJlZGlzRXJyb3JzCiAgICBhbm5vdGF0aW9uczoKICAgICAgcnVuYm9vazogaHR0cHM6Ly9kb2NzLnNvbG8uaW8vZ2xvby1tZXNoLWVudGVycHJpc2UvbWFpbi90cm91Ymxlc2hvb3RpbmcvZ2xvby8KICAgICAgc3VtbWFyeTogR2xvbyBQbGF0Zm9ybSBoYXMgZGV0ZWN0ZWQge3sgJHZhbHVlIHwgaHVtYW5pemUgfX0gUmVkaXMgc3luYyBlcnJvcnMgaW4gdGhlIGxhc3QgNW0uCiAgICBleHByOiBpbmNyZWFzZShnbG9vX21lc2hfcmVkaXNfc3luY19lcnJbNW1dKSA+IDAKICAgIGxhYmVsczoKICAgICAgc2V2ZXJpdHk6IHdhcm5pbmcK prometheus.yml: cnVsZV9maWxlczoKLSAvZXRjL2NvbmZpZy9yZWNvcmRpbmdfcnVsZXMueW1sCi0gL2V0Yy9jb25maWcvYWxlcnRpbmdfcnVsZXMueW1sCi0gL2V0Yy9jb25maWcvcnVsZXMKLSAvZXRjL2NvbmZpZy9hbGVydHMKc2NyYXBlX2NvbmZpZ3M6Ci0gam9iX25hbWU6IHByb21ldGhldXMKICBzdGF0aWNfY29uZmlnczoKICAtIHRhcmdldHM6CiAgICAtIGxvY2FsaG9zdDo5MDkwCi0gam9iX25hbWU6IG90ZWwtY29sbGVjdG9yCiAgaG9ub3JfbGFiZWxzOiB0cnVlCiAga3ViZXJuZXRlc19zZF9jb25maWdzOgogIC0gcm9sZTogcG9kCiAgICBuYW1lc3BhY2VzOgogICAgICBuYW1lczoKICAgICAgLSBnbG9vLW1lc2gKICBzY3JhcGVfaW50ZXJ2YWw6IDMwcwogIHNjcmFwZV90aW1lb3V0OiAyMHMKICByZWxhYmVsX2NvbmZpZ3M6CiAgLSBhY3Rpb246IGtlZXAKICAgIHJlZ2V4OiBzdGFuZGFsb25lLWNvbGxlY3RvcnxhZ2VudC1jb2xsZWN0b3IKICAgIHNvdXJjZV9sYWJlbHM6CiAgICAtIF9fbWV0YV9rdWJlcm5ldGVzX3BvZF9sYWJlbF9jb21wb25lbnQKICAtIGFjdGlvbjoga2VlcAogICAgcmVnZXg6IHRydWUKICAgIHNvdXJjZV9sYWJlbHM6CiAgICAtIF9fbWV0YV9rdWJlcm5ldGVzX3BvZF9hbm5vdGF0aW9uX3Byb21ldGhldXNfaW9fc2NyYXBlCiAgLSBhY3Rpb246IGRyb3AKICAgIHJlZ2V4OiB0cnVlCiAgICBzb3VyY2VfbGFiZWxzOgogICAgLSBfX21ldGFfa3ViZXJuZXRlc19wb2RfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3NjcmFwZV9zbG93CiAgLSBhY3Rpb246IHJlcGxhY2UKICAgIHJlZ2V4OiAoaHR0cHM/KQogICAgc291cmNlX2xhYmVsczoKICAgIC0gX19tZXRhX2t1YmVybmV0ZXNfcG9kX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19zY2hlbWUKICAgIHRhcmdldF9sYWJlbDogX19zY2hlbWVfXwogIC0gYWN0aW9uOiByZXBsYWNlCiAgICByZWdleDogKC4rKQogICAgc291cmNlX2xhYmVsczoKICAgIC0gX19tZXRhX2t1YmVybmV0ZXNfcG9kX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19wYXRoCiAgICB0YXJnZXRfbGFiZWw6IF9fbWV0cmljc19wYXRoX18KICAjIFN1cHBvcnRpbmcgYm90aCBJUHY0IGFuZCBJUHY2CiAgLSBhY3Rpb246IHJlcGxhY2UKICAgIHJlZ2V4OiAoXGQrKTsoKFtBLUZhLWYwLTldezEsNH06Oj8pezEsN31bQS1GYS1mMC05XXsxLDR9KQogICAgcmVwbGFjZW1lbnQ6ICdbJDJdOiQxJwogICAgc291cmNlX2xhYmVsczoKICAgICAgLSBfX21ldGFfa3ViZXJuZXRlc19wb2RfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3BvcnQKICAgICAgLSBfX21ldGFfa3ViZXJuZXRlc19wb2RfaXAKICAgIHRhcmdldF9sYWJlbDogX19hZGRyZXNzX18KICAtIGFjdGlvbjogcmVwbGFjZQogICAgcmVnZXg6IChcZCspOygoKFswLTldKz8pKFwufCQpKXs0fSkKICAgIHJlcGxhY2VtZW50OiAkMjokMQogICAgc291cmNlX2xhYmVsczoKICAgICAgLSBfX21ldGFfa3ViZXJuZXRlc19wb2RfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3BvcnQKICAgICAgLSBfX21ldGFfa3ViZXJuZXRlc19wb2RfaXAKICAgIHRhcmdldF9sYWJlbDogX19hZGRyZXNzX18KICAtIGFjdGlvbjogbGFiZWxtYXAKICAgIHJlZ2V4OiBfX21ldGFfa3ViZXJuZXRlc19wb2RfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3BhcmFtXyguKykKICAgIHJlcGxhY2VtZW50OiBfX3BhcmFtXyQxCiAgLSBhY3Rpb246IGxhYmVsbWFwCiAgICByZWdleDogX19tZXRhX2t1YmVybmV0ZXNfcG9kX2xhYmVsXyguKykKICAtIGFjdGlvbjogcmVwbGFjZQogICAgc291cmNlX2xhYmVsczoKICAgIC0gX19tZXRhX2t1YmVybmV0ZXNfbmFtZXNwYWNlCiAgICB0YXJnZXRfbGFiZWw6IG5hbWVzcGFjZQogIC0gYWN0aW9uOiByZXBsYWNlCiAgICBzb3VyY2VfbGFiZWxzOgogICAgLSBfX21ldGFfa3ViZXJuZXRlc19wb2RfbmFtZQogICAgdGFyZ2V0X2xhYmVsOiBjb2xsZWN0b3JfcG9kCiAgLSBhY3Rpb246IGRyb3AKICAgIHJlZ2V4OiBQZW5kaW5nfFN1Y2NlZWRlZHxGYWlsZWR8Q29tcGxldGVkCiAgICBzb3VyY2VfbGFiZWxzOgogICAgLSBfX21ldGFfa3ViZXJuZXRlc19wb2RfcGhhc2UKICAjIERyb3AgbGFiZWxzCiAgbWV0cmljX3JlbGFiZWxfY29uZmlnczoKICAtIGFjdGlvbjogbGFiZWxkcm9wCiAgICByZWdleDogYXBwX2t1YmVybmV0ZXNfaW9faW5zdGFuY2V8YXBwX2t1YmVybmV0ZXNfaW9fbmFtZXxpbnN0YW5jZXxqb2J8cG9kX3RlbXBsYXRlX2hhc2gK kind: Secret metadata: annotations: meta.helm.sh/release-name: gloo-mesh-core meta.helm.sh/release-namespace: gloo-mesh creationTimestamp: "2023-10-26T14:11:44Z" labels: app.kubernetes.io/managed-by: Helm name: gloo-prometheus-server namespace: gloo-mesh resourceVersion: "3195993" uid: 6585b914-8d49-4623-a62f-d9bec09a4448 type: Opaque
Decode the
alert.yml
configuration.echo "Z3JvdXBzOgotIG5hbWU6IEdsb29QbGF0Zm..." | base64 -D
Latency alerts
Gloo Mesh Core sets up default alerts that monitor the time it takes for a Gloo resource to get translated or reconciled.
GlooPlatformTranslationLatencyIsHigh
Use this alert to receive warnings when the Gloo management server takes longer than usual to translate Gloo resources in to the corresponding Istio resources. The alert is configured as a histogram and is set on the 99th percentile. Alerts are sent when the translation time is higher than 10 seconds for 99% of the time during a specific timeframe.
You can customize this alert and for example send critical alerts when you reach the 99th percentile, and warnings for lower percentiles, such as 70. Note that the percentile value depends on your environment. For example, if the cluster that runs your Gloo management server is shared, you might to want use a lower percentile so that you have enough time to reschedule workloads or add resources if translation times become critical. On the other hand, if your cluster is dedicated to the Gloo management plane only and comes with additional compute resources, you can use higher percentile values for your critical alerts.
Review the default configuration of the alert. You can customize values, such as the severity, the overall timeframe that the threshold must meet before an alert is sent (duration), or the interval in which data is collected for the histogram (bucket distribution).
Characteristic | Value |
---|---|
Type | Histogram |
Expression | histogram_quantile(0.99, sum(rate(gloo_mesh_translation_time_sec_bucket[5m])) by(le)) > 10 |
Duration | 15 Minutes |
Severity | Warning |
Bucket distribution in seconds | 1, 2, 5, 10, 15, 20, 25, 30, 45, 60, 120 |
Recommended troubleshooting guide | Link |
GlooPlatformReconcilerLatencyIsHigh
The Gloo reconciler applies translated Gloo resources in your workload clusters so that the desired state in your Gloo environment can be reached. This alert notifies you when the time that the reconciler needs to apply the desired resources takes longer than 80 seconds. The alert is configured as a histogram and set on the 99th percentile.
You can customize this alert and for example send critical alerts when you reach the 99th percentile, and warnings for lower percentiles, such as 70. Note that the percentile value depends on your environment. For example, if the cluster that runs your Gloo management server is shared, you might want to use a lower percentile so that you have enough time to reschedule workloads or add resources if reconciliation times become critical. On the other hand, if your cluster is dedicated to the Gloo management plane only and comes with additional compute resources, you can use higher percentile values for your critical alerts.
Review the default configuration of the alert. You can customize values, such as the severity, the overall timeframe that the threshold must be met before an alert is sent (duration), or the interval in which data is collected for the histogram (bucket distribution).
Characteristic | Value |
---|---|
Type | Histogram |
Expression | histogram_quantile(0.99, sum(rate(gloo_mesh_reconciler_time_sec_bucket[5m])) by(le)) > 80 |
Duration | 15 Minutes |
Severity | Warning |
Bucket distribution in seconds | 1, 2, 5, 10, 15, 30, 50, 80, 100, 200 |
Recommended troubleshooting guide | Link |
Gloo agents alerts
Gloo automatically monitors the relay connection between the Gloo management server and Gloo agents, and notifies you if issues are found.
GlooPlatformAgentsAreDisconnected
This alert is used to notify you when a Gloo agent in a workload cluster is not connected to the Gloo management server. By default, you get a warning alert as soon as one cluster loses connectivity to the management server. However, depending on your cluster environment you might want to change this alert to critical.
Review the default configuration of the alert. You can customize values, such as the severity, or the overall timeframe that the threshold must be met before an alert is sent (duration).
Characteristic | Value |
---|---|
Type | Counter |
Expression | count by(cluster) (sum by(cluster) (relay_push_clients_warmed == 0)) > 0 |
Duration | 5 Minutes |
Severity | Warning |
Recommended troubleshooting guide | Relay connectionAgent |
Translation alerts
Gloo automatically sets up alerts to monitor Gloo resources that cannot be translated correctly.
GlooPlatformTranslationWarnings
Sometimes Gloo resource configurations include partial errors or refer to other unknown Gloo resources, such as a gateway or destination. When the Gloo management server translates the resource, the translation itself works, but referenced resources for example could not be found. The Gloo resource is then marked with a warning
state. When a resource with a warning
state is found during a 5 minute timeframe, the alert is triggered.
Review the default configuration of the alert. You can customize values, such as the expression or severity.
Characteristic | Value |
---|---|
Type | Counter |
Expression | increase(translation_warning[5m]) > 0 |
Severity | Warning |
Recommended troubleshooting guide | Link |
GlooPlatformTranslationErrors
Translation errors can happen if the Gloo resource configuration is correct, but the Gloo management server has an issue with applying the resource or reconciling Gloo agents. If such an error occurs, the resource state changes to Failed
. When a resource with a failed
state is found during a 5 minute timeframe, the alert is triggered.
Review the default configuration of the alert. You can customize values, such as the expression or severity.
Characteristic | Value |
---|---|
Type | Counter |
Expression | increase(translation_error[5m]) > 0 |
Severity | Warning |
Recommended troubleshooting guide | Link |
Redis alerts
Gloo sets up alerts to monitor the read and write operations between the Gloo management server and Redis.
GlooPlatformRedisErrors
In the event that the Gloo management server cannot read from the Gloo Redis instance during a 5 minute timeframe, an alert is automatically triggered.
Characteristic | Value |
---|---|
Type | Counter |
Expression | increase(gloo_mesh_redis_sync_err[5m]) > 0 |
Severity | Warning |
Recommended troubleshooting guide | Link |