You can gain insights into the health and performance of your cluster components by using the Gloo telemetry pipeline. Built on top of the OpenTelemetry open source project, the Gloo telemetry pipeline helps you to collect and export telemetry data, such as metrics, logs, traces, and Gloo insights, and to visualize this data by using Gloo observability tools.

Review the information on this page to learn more about the Gloo telemetry pipeline and how to use it in your cluster.

Setup

The Gloo telemetry pipeline is set up by default when you install Gloo Mesh.

To see the receivers, processors, and exporters that are set up by default for you, run the following commands:

  kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml
kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml
  

Disable the telemetry pipeline

If you want to disable the Gloo telemetry pipeline, follow the Upgrade guide and add the following configuration to your Helm values file:

  
telemetryCollector:
  enabled: false
telemetryGateway:
  enabled: false
  

Customize the pipeline

You can customize the Gloo telemetry pipeline and set up additional receivers, processors, and exporters in your pipeline. The Gloo telemetry pipeline is set up with pre-built pipelines that use a variety of receivers, processors, and exporters to collect and store telemetry data in your cluster. You can enable and disable these pipelines as part of your Helm installation.

Because the Gloo telemetry pipeline is built on top of the OpenTelemetry open source project, you also have the option to add your own custom receivers, processors, and exporters to the pipeline. For more information, see the pipeline architecture information in the OpenTelemetry documentation.

To see the receivers, processors, and exporters that are set up by default for you, run the following commands:

  kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml
kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml
  

To add more telemetry data to the Gloo telemetry pipeline, see Customize the pipeline.

Architecture

The Gloo telemetry pipeline is decoupled from the Gloo agents and management server core functionality, and consists of two main components: the Gloo telemetry collector agent and telemetry gateway.

Flip through the cards to see how these components are set up in a single and multicluster environment.

Learn more about the telemetry data that is collected in the Gloo telemetry pipeline.

Built-in telemetry pipelines

The Gloo telemetry pipeline is set up with default pipelines that you can enable to collect telemetry data in your cluster.

Default metrics in the pipeline

By default, the Gloo telemetry pipeline is configured to scrape the metrics that are required for the Gloo UI from various workloads in your cluster by using the metrics/ui and metrics/prometheus pipelines. The built-in Prometheus server is configured to scrape metrics from the Gloo collector agent (single cluster), or Gloo telemetry gateway and collector agent (multicluster). To reduce cardinality in the Gloo telemetry pipeline, only a few labels are collected for each metric. For more information, see Metric labels.

Review the metrics that are available in the Gloo telemetry pipeline. You can set up additional receivers to scrape other metrics, or forward the metrics to other observability tools, such as Datadog, by creating your own custom exporter for the Gloo telemetry gateway. To find an example setup, see Forward metrics to Datadog.

Istio proxy, ztunnel, and waypoint proxy metrics

MetricDescription
istio_requests_totalThe number of requests that were processed for an Istio proxy.
istio_request_duration_millisecondsThe time it takes for a request to reach its destination in milliseconds.
istio_request_duration_milliseconds_bucketThe time it takes for a request to reach its destination in milliseconds.
istio_request_duration_milliseconds_countThe total number of Istio requests since the Istio proxy was last started.
istio_request_duration_milliseconds_sumThe sum of all request durations since the last start of the Istio proxy.
istio_tcp_sent_bytesThe number of bytes that are sent in a response at a particular moment in time.
istio_tcp_sent_bytes_totalThe total number of bytes that are sent in a response.
istio_tcp_received_bytesThe number of bytes that are received in a request at a particular moment in time.
istio_tcp_received_bytes_totalThe total number of bytes that are received in a request.
istio_tcp_connections_openedThe number of open connections to an Istio proxy at a particular moment in time.
istio_tcp_connections_opened_totalThe total number of open connections to an Istio proxy.

Istiod metrics

MetricDescription
pilot_proxy_convergence_timeThe time it takes between applying a configuration change and the Istio proxy receiving the configuration change.

Gloo management server metrics

MetricDescription
gloo_mesh_build_snapshot_metric_time_secThe time in seconds for the Gloo management server to generate an output snapshot for connected Gloo agents.
gloo_mesh_garbage_collection_time_secThe time it takes for the garbage collector to clean up unused resources in seconds, such as after the custom resource translation.
gloo_mesh_reconciler_time_sec_bucketThe time the Gloo management server needs to sync with the Gloo agents in the workload clusters to apply the translated resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 30, 50, 80, 100, 200.
gloo_mesh_redis_relation_err_totalThe number of errors that occurred during a read or write operation of relationship data to Redis.
gloo_mesh_redis_sync_err_totalThe number of times the Gloo management server could not read from or write to the Gloo Redis instance.
gloo_mesh_redis_write_time_secThe time it takes in seconds for the Gloo management server to write to the Redis database.
gloo_mesh_relay_client_delta_pull_time_secThe time it takes for a Gloo agent to receive a delta output snapshot from the Gloo management server in seconds.
gloo_mesh_relay_client_delta_pull_errThe number of errors that occurred while sending a delta output snapshot to a connected Gloo agent.
gloo_mesh_relay_client_delta_push_last_loop_timestamp_secondsThe unix timestamp (in seconds) of the last time the relay agent created a delta snapshot. This metric is generated, even if the snapshot was empty and not sent to the management server.
gloo_mesh_relay_client_delta_push_time_secThe time it takes for a Gloo agent to send a delta input snapshot to the Gloo management server in seconds.
gloo_mesh_relay_client_delta_push_errThe number of errors that occurred while sending a delta input snapshot from the Gloo agent to the Gloo management server.
gloo_mesh_relay_client_last_delta_pull_received_timestamp_secondsThe unix timestamp (in seconds) of the last time the relay agent received a delta snapshot from the management server.
gloo_mesh_relay_client_last_delta_push_timestamp_secondsThe unix timestamp (in seconds) of the last time the relay agent pushed a delta snapshot (either non-empty, or the initial snapshot).
gloo_mesh_relay_client_last_server_communication_pull_stream_timestamp_secondsThe unix timestamp (in seconds) of the last time the relay agent received a response from the management server.
gloo_mesh_snapshot_upserter_op_time_secThe time it takes for a snapshot to be updated and/or inserted in the Gloo management server local memory in seconds.
gloo_mesh_safe_mode_activeIndicates whether safe mode is enabled in the Gloo management server. For more information, see Redis safe mode options.
gloo_mesh_translation_time_sec_bucketThe time the Gloo management server needs to translate Gloo resources into Istio or Envoy resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 20, 25, 30, 45, 60, and 120.
gloo_mesh_translator_concurrencyThe number of translation operations that the Gloo management server can perform at the same time.
object_write_fails_totalThe number of times the Gloo agent tried to write invalid Istio configuration to the cluster that was rejected by the Istio control plane istiod.
relay_pull_clients_connectedThe number of Gloo agents that are connected to the Gloo management server.
relay_push_clients_warmedThe number of Gloo agents that are ready to accept updates from the Gloo management server.
solo_io_gloo_gateway_licenseThe number of minutes until the Gloo Mesh Gateway license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration.
solo_io_gloo_mesh_licenseThe number of minutes until the Gloo Mesh Enterprise license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration.
translation_errorThe number of translation errors that were reported by the Gloo management server.
translation_warningThe number of translation warnings that were reported by the Gloo management server.

Gloo telemetry pipeline metrics

MetricDescription
otelcol_processor_refused_metric_pointsThe number of metrics that were refused by the Gloo telemetry pipeline processor. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources.
otelcol_receiver_refused_metric_pointsThe number of metrics that were refused by the Gloo telemetry pipeline receiver. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources.
otelcol_processor_refused_spansThe metric spans that were refused by the memory_limiter in the Gloo telemetry pipeline to prevent collector agents from being overloaded.
otelcol_exporter_queue_capacityThe amount of telemetry data that can be stored in memory while waiting on a worker in the collector agent to become available to send the data.
otelcol_exporter_queue_sizeThe amount of telemetry data that is currently stored in memory. If the size is equal or larger than otelcol_exporter_queue_capacity, new telemetry data is rejected.
otelcol_loadbalancer_backend_latencyThe time the collector agents need to export telemetry data.
otelcol_exporter_send_failed_spansThe number of telemetry data spans that could not be sent to a backend.

Metrics labels

To reduce cardinality in the Gloo telemetry pipeline, only the following labels are collected for each metric.

Metric groupLabels
Istio[“cluster”, “collector_pod” , “connection_security_policy”, “destination_cluster”, “destination_principal”, “destination_service”, “destination_workload”, “destination_workload_id”, “destination_workload_namespace”, “gloo_mesh”, “namespace”, “pod_name”, “reporter”, “response_code”, “source_cluster”, “source_principal”, “source_workload”, “source_workload_namespace”, “version”, “workload_id”]
Telemetry pipeline[“app”, “cluster”, “collector_name”, “collector_pod”, “component”, “exporter”, “namespace”, “pod_template_generation”, “processor”, “service_version”]

Observability tools

The Gloo observability pipeline comes with several observability tools that help you monitor the health of Gloo and Istio components and the workloads in your cluster.

Gloo Mesh health

Gloo UI

View the configuration and status of Gloo custom resources. You can also view the health of the clusters that are registered with the Gloo management server. To monitor the health of your Gloo Mesh components, such as the Gloo management server or Gloo telemetry collector agent, use the Gloo UI log viewer to view, filter, search, or download logs for these components.

Prometheus

Use the Prometheus expression browser to run PromQL queries to analyze and aggregate Gloo Mesh metrics. To view metrics that are collected by default, see Gloo management server metrics. To view the alerts that are automatically set up for you, see Alerts.

Ingress gateway

Gloo UI

View the components of your gateway setup. To monitor the traffic to your gateway, you can access the Gloo UI Graph.

Insights

Check the Istio insights that the Gloo analyzer collects for your gateways and reports in the Gloo UI. These insights can help determine the security posture of your setup, the gateway health, and production readiness. The insights give you a checklist to address issues that might otherwise be hard to detect across your environment.

Prometheus

The Gloo telemetry pipeline collects Istio metrics from the ingress gateway proxy and exposes those metrics so that the built-in Prometheus server can scrape them. To view the metrics that are collected by default, see Istio proxy metrics. You can access these metrics by running PromQL queries in the Prometheus expression browser. To find example queries that you can run, see Ingress gateway queries.

Jaeger

You can enable request tracing for the ingress gateway and add these traces to the Gloo telemetry pipeline so that they can be forwarded to the built-in or a custom Jaeger instance. For more information about how to set up tracing, and how to enable Jaeger, see Add Istio request traces.

Istio access logs

Leverage the default Envoy access log collector to record logs for the apps that send requests to the Istio ingress gateway. You can review these logs to troubleshoot issues as-needed, or scrape these logs to view them in your larger platform logging system.

Service mesh

Gloo UI

View your service mesh workloads. To monitor the traffic to your service mesh workloads, you can access the Gloo UI Graph.

Insights

Check the Istio insights that the Gloo analyzer collects for your service mesh workloads and reports in the Gloo UI. These insights can help determine the security posture of your workloads, their health, and production readiness. The insights give you a checklist to address issues that might otherwise be hard to detect across your environment.

Prometheus

The Gloo telemetry pipeline collects Istio metrics from the Istio-enabled workloads and exposes those metrics so that the built-in Prometheus server can scrape them. To view the metrics that are collected by default, see Istio proxy metrics. You can access these metrics by running PromQL queries in the Prometheus expression browser. To find example queries that you can run, see Service mesh workload queries.

Jaeger

You can enable request tracing for Istio-enabled workloads and add these traces to the Gloo telemetry pipeline so that they can be forwarded to the built-in or a custom Jaeger instance. For more information about how to set up tracing, and how to enable Jaeger, see Add Istio request traces.

Istio access logs

Leverage the default Envoy access log collector to record logs for the apps that send requests to Istio-enabled workloads in your service mesh. You can review these logs to troubleshoot issues as-needed, or scrape these logs to view them in your larger platform logging system.