Skip to content
You are viewing the documentation for Solo Enterprise for Istio, formerly known as Gloo Mesh (OSS APIs).

Metrics

Page as Markdown

Review default metrics that are available in Prometheus so that you can monitor the health of Solo Enterprise for Istio components and Istio workloads.

View metrics

To view all metrics that are available in Prometheus, follow these steps:

  1. Port-forward the Prometheus pod in your cluster.

    meshctl proxy prometheus --kubecontext ${context1}

    Port-forward the prometheus-server deployment on 9091.

    kubectl -n gloo-mesh port-forward deploy/prometheus-server 9091 --context ${context1}

  2. Open the Prometheus expression browser to run PromQL queries on metrics.

Default metrics

If you follow one of the Solo Enterprise for Istio setup guides, the telemetry pipeline is automatically set up for you. The telemetry pipeline collects the following metrics that are required for the Gloo UI Graph. Prometheus scrapes the telemetry gateway and collector agent to feed other observability tools. You can also use the Prometheus expression browser to run PromQL queries on these metrics.

Istio proxy, ztunnel, and waypoint proxy metrics

|Metric|Description| |—|—-||istio_outlier_detection_endpoints| The total number of backend pod endpoints for a workload. | |istio_outlier_detection_endpoints_unhealthy| The number of backend pod endpoints that ztunnel detects as unhealthy and does not route to. Check the istio_tcp_connections_failed metric for more information on failures. | |istio_response_bytes| The number of bytes that are returned in the HTTP response. | |istio_request_bytes| The number of bytes that were sent in the HTTP request. | |istio_requests_total|The number of requests that were processed for an Istio proxy. For this metric to be collected at Layer 7 for ztunnels, you must set L7_ENABLED=true. | |istio_request_duration_milliseconds|The time it takes for a request to reach its destination in milliseconds. For this metric and the bucket, count, and sum submetrics to be collected at Layer 7 for ztunnels, you must set L7_ENABLED=true. | |istio_request_duration_milliseconds_bucket|The time it takes for a request to reach its destination in milliseconds.| |istio_request_duration_milliseconds_count|The total number of Istio requests since the Istio proxy was last started.| |istio_request_duration_milliseconds_sum|The sum of all request durations since the last start of the Istio proxy.| |istio_tcp_sent_bytes|The number of bytes that are sent in a response at a particular moment in time. | |istio_tcp_sent_bytes_total|The total number of bytes that are sent in a response. | |istio_tcp_received_bytes|The number of bytes that are received in a request at a particular moment in time.| |istio_tcp_received_bytes_total|The total number of bytes that are received in a request.||istio_tcp_connections_failed|The number of failed TCP connections to an Istio proxy at a particular moment in time.| |istio_tcp_connections_opened|The number of open connections to an Istio proxy at a particular moment in time.| |istio_tcp_connections_opened_total|The total number of open connections to an Istio proxy.|

Istiod metrics

MetricDescription
pilot_proxy_convergence_timeThe time it takes between applying a configuration change and the Istio proxy receiving the configuration change.

Solo Enterprise for Istio management server metrics

MetricDescription
gloo_mesh_build_snapshot_metric_time_secThe time in seconds for the management server to generate an output snapshot for connected agents.
gloo_mesh_garbage_collection_time_secThe time it takes for the garbage collector to clean up unused resources in seconds, such as after the custom resource translation.
gloo_mesh_reconciler_time_sec_bucketThe time the management server needs to sync with the agents in the workload clusters to apply the translated resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 30, 50, 80, 100, 200.
gloo_mesh_redis_relation_err_totalThe number of errors that occurred during a read or write operation of relationship data to Redis.
gloo_mesh_redis_sync_err_totalThe number of times the management server could not read from or write to the Redis instance.
gloo_mesh_redis_write_time_secThe time it takes in seconds for the management server to write to the Redis database.
gloo_mesh_relay_client_delta_pull_time_secThe time it takes for a agent to receive a delta output snapshot from the management server in seconds.
gloo_mesh_relay_client_delta_pull_errThe number of errors that occurred while sending a delta output snapshot to a connected agent.
gloo_mesh_relay_client_delta_push_last_loop_timestamp_secondsThe unix timestamp (in seconds) of the last time the agent created a delta snapshot. This metric is generated, even if the snapshot was empty and not sent to the management server.
gloo_mesh_relay_client_delta_push_time_secThe time it takes for a agent to send a delta input snapshot to the management server in seconds.
gloo_mesh_relay_client_delta_push_errThe number of errors that occurred while sending a delta input snapshot from the agent to the management server.
gloo_mesh_relay_client_last_delta_pull_received_timestamp_secondsThe unix timestamp (in seconds) of the last time the agent received a delta snapshot from the management server.
gloo_mesh_relay_client_last_delta_push_timestamp_secondsThe unix timestamp (in seconds) of the last time the agent pushed a delta snapshot (either non-empty, or the initial snapshot).
gloo_mesh_relay_client_last_server_communication_pull_stream_timestamp_secondsThe unix timestamp (in seconds) of the last time the agent received a response from the management server.
gloo_mesh_snapshot_upserter_op_time_secThe time it takes for a snapshot to be updated and/or inserted in the management server local memory in seconds.
gloo_mesh_safe_mode_activeIndicates whether safe mode is enabled in the management server. For more information, see Redis safe mode options.
gloo_mesh_translation_time_sec_bucketThe time the management server needs to translate Gloo resources into Istio or Envoy resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 20, 25, 30, 45, 60, and 120.
gloo_mesh_translator_concurrencyThe number of translation operations that the management server can perform at the same time.
object_write_fails_totalThe total number of failures that occurred when attempting to write an Istio object to storage. For example, this metric increases if invalid Istio configuration is rejected by the Istio control plane istiod. Write failures can occur during an upsert, delete, or status upsert action.
relay_pull_clients_connectedThe number of agents that are connected to the management server.
relay_push_clients_warmedThe number of agents that are ready to accept updates from the management server.
translation_errorThe number of translation errors that were reported by the management server.
translation_warningThe number of translation warnings that were reported by the management server.

Telemetry pipeline metrics

MetricDescription
otelcol_processor_refused_metric_pointsThe number of metrics that were refused by the telemetry pipeline processor. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources.
otelcol_receiver_refused_metric_pointsThe number of metrics that were refused by the telemetry pipeline receiver. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources.
otelcol_processor_refused_spansThe metric spans that were refused by the memory_limiter in the telemetry pipeline to prevent collector agents from being overloaded.
otelcol_exporter_queue_capacityThe amount of telemetry data that can be stored in memory while waiting on a worker in the collector agent to become available to send the data.
otelcol_exporter_queue_sizeThe amount of telemetry data that is currently stored in memory. If the size is equal or larger than otelcol_exporter_queue_capacity, new telemetry data is rejected.
otelcol_loadbalancer_backend_latencyThe time the collector agents need to export telemetry data.
otelcol_exporter_send_failed_spansThe number of telemetry data spans that could not be sent to a backend.

Metric labels

To reduce cardinality in the telemetry pipeline, only the following labels are collected for each metric.

Metric groupLabels
Istio[“cluster”, “collector_pod” , “connection_security_policy”, “destination_cluster”, “destination_principal”, “destination_service”, “destination_workload”, “destination_workload_id”, “destination_workload_namespace”, “gloo_mesh”, “namespace”, “pod_name”, “reporter”, “response_code”, “source_cluster”, “source_principal”, “source_workload”, “source_workload_namespace”, “version”, “workload_id”]
Istio outlier detection[“destination_cluster”, “destination_network”, “destination_workload”, “destination_workload_namespace”, “destination_workload_type”]
Peering[“source”, “peer”]
Telemetry pipeline[“app”, “cluster”, “collector_name”, “collector_pod”, “component”, “exporter”, “namespace”, “pod_template_generation”, “processor”, “service_version”]