On this page

About the telemetry pipeline

Learn about the Gloo telemetry pipeline architecture, its components, and default pipelines that you can choose from.

You can gain insights into the health and performance of your cluster components by using the Gloo telemetry pipeline. Built on top of the OpenTelemetry open source project, the Gloo telemetry pipeline helps you to collect and export telemetry data, such as metrics, logs, and Gloo insights, and to visualize this data by using Gloo observability tools.

Review the information on this page to learn more about the Gloo telemetry pipeline and how to use it in your cluster.

Setup

The Gloo telemetry pipeline is set up by default if you followed one of the installation guides:

Use a meshctl installation profile, such as in the Get started guide.
Use Helm to install Gloo Network.

To see the receivers, processors, and exporters that are set up by default for you, run the following commands:

  kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml
kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml

Disable the telemetry pipeline

If you want to disable the Gloo telemetry pipeline, follow the Upgrade guide and add the following configuration to your Helm values file:

  
telemetryCollector:
  enabled: false
telemetryGateway:
  enabled: false

report

Disabling the Gloo telemetry pipeline removes the Gloo telemetry gateway and collector agent pods from your cluster. If you previously collected telemetry data, and data was not exported to a different observability tool, all telemetry data is removed. To keep telemetry data, consider exporting data to other observability tools, such as Prometheus, Jaeger, or your own before disabling the telemetry pipeline.

Customize the pipeline

You can customize the Gloo telemetry pipeline and set up additional receivers, processors, and exporters in your pipeline. The Gloo telemetry pipeline is set up with pre-built pipelines that use a variety of receivers, processors, and exporters to collect and store telemetry data in your cluster. You can enable and disable these pipelines as part of your Helm installation.

Because the Gloo telemetry pipeline is built on top of the OpenTelemetry open source project, you also have the option to add your own custom receivers, processors, and exporters to the pipeline. For more information, see the pipeline architecture information in the OpenTelemetry documentation.

To see the receivers, processors, and exporters that are set up by default for you, run the following commands:

  kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml
kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml

To add more telemetry data to the Gloo telemetry pipeline, see Customize the pipeline.

Architecture

The Gloo telemetry pipeline is decoupled from the Gloo agents and management server core functionality, and consists of two main components: the Gloo telemetry collector agent and telemetry gateway.

Flip through the cards to see how these components are set up in a single and multicluster environment.

In single cluster setups, only a Gloo telemetry collector agent is deployed to the cluster. The agent is configured to scrape metrics from workloads in the cluster, and to enrich the data, such as by adding workload IDs that you can later use to filter metrics. In addition, it receives other telemetry data, such as traces and Gloo analyzer logs. Depending on the type of telemetry data, the collector agent then forwards this data to other observability tools, such as Jaeger as shown in the following image.

You also have the option to set up your own exporter to forward telemetry data to an observability tool of your choice. To see an example for how to export data to Datadog, see Forward metrics to Datadog.

Figure: Gloo telemetry pipeline architecture in single cluster setups

In multicluster setups, Gloo telemetry collector agents are deployed to the management cluster and each workload cluster. The agents are configured to scrape metrics from workloads in the cluster, and to enrich the data, such as by adding workload IDs that you can later use to filter metrics. In addition, they receive other telemetry data, such as traces and Gloo analyzer logs.

A Gloo telemetry gateway is also deployed to the management cluster and exposed with a Kubernetes load balancer service. The gateway consolidates data in the Gloo management plane so that it can be forwarded to the built-in Gloo observability tools. The collector agents in each workload cluster send all their telemetry data to the telemetry gateway’s service endpoint. You can choose from a set of pre-built pipelines to configure how the telemetry gateway forwards telemetry data within the cluster.

You also have the option to forward telemetry data to an observability tool of your choice by adding custom exporters to either the telemetry gateway or to each collector agent. The option that is right for you depends on the size of your environment, the amount of telemetry data that you want to export, and the compute resources that are available to the Gloo telemetry pipeline components. To see an example for how to export data to Datadog, see Forward metrics to Datadog.

Figure: Gloo telemetry pipeline architecture in multicluster setups

info

The diagram shows the default ports that are added as prometheus.io/port: "<port_number>" pod annotations to the workloads that expose metrics. This port is automatically used by the Gloo collector agent, Gloo telemetry gateway, and Prometheus to scrape the metrics from these workloads. You can change the port by changing the pod annotation. However, keep in mind that changing the default scraping ports might lead to unexpected results, because Gloo Network processes might depend on the default setting.

Learn more about the telemetry data that is collected in the Gloo telemetry pipeline.

When you enable the Gloo telemetry pipeline, the collector agents and, if applicable, the telemetry gateway are configured to collect metrics in your Gloo Network environment.

Gloo telemetry collector agents scrape metrics from workloads in your cluster, such as the Gloo agents and management server. To determine the workloads that need to be scraped and find the port where metrics are exposed, the prometheus.io/scrape: "true" and prometheus.io/port: "<port_number>" pod annotations are used. All Gloo components that expose metrics and all Istio- and Cilium-specific workloads are automatically deployed with these annotations.

notifications

In Gloo version 2.5.0, the prometheus.io/port: "<port_number>" annotation was removed from the Gloo management server and agent. However, the prometheus.io/scrape: true annotation is still present. If you have another Prometheus instance that runs in your cluster, and it is not set up with custom scraping jobs for the Gloo management server and agent, the instance automatically scrapes all ports on the management server and agent pods. This can lead to error messages in the management server and agent logs. Note that this issue is resolved in version 2.5.2. To resolve this issue in earlier patch versions, see Run another Prometheus instance alongside the built-in one.

The agents then enrich and convert the metrics. For example, the ID of the source and destination workload is added to the metrics so that you can filter the metrics for the workload that you are interested in.

The built-in Prometheus server is set up to scrape metrics from the Gloo telemetry collector agents. In multicluster setups, the Prometheus server also scrapes metrics from the Gloo telemetry gateway in the management cluster that receives metrics from all workload clusters.

Observability tools, such as the Gloo UI, read metrics from Prometheus and visualize this data so that you can monitor the health of the Gloo Network components, and to receive alerts if an issue is detected. For more information, see the Prometheus overview.

You can configure the Gloo telemetry pipeline to collect metadata about the compute instances, such as virtual machines, that the workload cluster is deployed to so that you can visualize your Gloo Network setup across your cloud provider infrastructure network. The metadata is added as labels to metrics and exposed on the Gloo telemetry collector agent (single cluster), or sent to the Gloo telemetry gateway (multicluster) where they can be scraped by the built-in Prometheus server. You can then use the Prometheus expression browser to analyze these metrics.

For more information, see Collect compute instance metadata.

If your cluster uses the Cilium CNI, some Cilium-specific metrics are collected by default to visualize network communication in the Gloo UI Graph. To add more Cilium, Hubble, and eBPF-specific metrics to the Gloo telemetry pipeline so that you can access them with the expression browser of the built-in Prometheus server, you can enable a pre-built Cilium processor on the Gloo telemetry collector agent. The processor exposes the metrics on the collector agent (single cluster) or sends them to the Gloo telemetry gateway (multicluster) where they can be scraped by the built-in Prometheus server. For more information, see Add Cilium metrics.

In addition, you have the option to add Hubble network flows to the Gloo telemetry collector agent configuration. Flow logs are exposed on the collector agent (single cluster) or sent to the Gloo telemetry gateway (multicluster), and can be accessed by using the meshctl hubble observe command. You can optionally set up a custom exporter to export these logs to an observability tool of your choice, such as Redis. For more information, see Add Cilium flow logs.

Gloo Network comes with an insights engine that automatically analyzes your Cilium setup for health issues. Then, Gloo shares these issues along with recommendations to harden your Cilium setup. The insights give you a checklist to address issues that might otherwise be hard to detect across your environment.

If you follow the Get started guide, insights are automatically enabled in your Gloo Network environment. The Gloo analyzer component monitors and analyzes Cilium workloads and sends analyzer results as logs to the Gloo telemetry collector agent. The agent writes these logs to Redis (single cluster), or forwards the logs to the Gloo telemetry gateway (multicluster) where they are written to Redis.

The Gloo management server uses the analyzer results in Redis and executes queries on Prometheus metrics to create Cilium insights. These insights are then stored in Redis so that the Gloo UI can read and displays them to the user.

For more information, see Add Cilium insights.

Built-in telemetry pipelines

The Gloo telemetry pipeline is set up with default pipelines that you can enable to collect telemetry data in your cluster.

Telemetry data	Collector agent pipeline	Description
Cilium flow logs	`logs/cilium_flows`	The `logs/cilium_flows` pipeline collects Hubble flow logs for the workloads in the cluster. Flow logs are exposed on the Gloo telemetry collector agent. You can access the flow logs with the `meshctl hubble observe` command. Note that your cluster must be set up to use the Cilium CNI for flow logs to be collected. For more information, see Add Cilium flow logs.
Cilium metrics	`metrics/cilium`	The `metrics/cilium` pipeline collects Cilium, Hubble, and eBPF-specific metrics. Metrics are exposed on the Gloo telemetry collector agent where they are scraped by the built-in Prometheus server. You can access the metrics by using the Prometheus expression browser. Note that your cluster must be set up to use the Cilium CNI for Cilium metrics to be collected. For more information, see Add Cilium metrics.
Compute metadata	`metrics/otlp_relay`	The `metrics/otlp_relay` pipeline collects metadata about the compute instances, such as virtual machines, that the workload cluster is deployed to, and adds the metadata as labels on metrics. The metrics are exposed on the Gloo telemetry collector agent where they can be scraped by the built-in Prometheus server. For more information, see Collect compute instance metadata.
Insights	`logs/analyzer`	The `logs/analyzer` pipeline is enabled by default and collects analyzer results from the Gloo analyzer component. Analyzer results are stored in Redis Streams where they can be picked up by the insights engine in the Gloo management server to generate insights later. If you follow the get started guide, the insights engine is already enabled in your environment, and analyzer results are collected by the Gloo telemetry pipeline. For more information, see Add Cilium insights.
Metrics	`metrics/ui`	The `metrics/ui` pipeline is enabled by default and collects the metrics that are required for the Gloo UI graph. Metrics in the collector agent are then scraped by the built-in Prometheus server so that they can be provided to Gloo observability tools. To view the metrics that are captured with this pipeline, see Default metrics in the pipeline.

Telemetry data	Collector agent pipeline	Gateway pipeline	Description
Cilium flow logs	`logs/cilium_flows`	N/A	The `logs/cilium_flows` pipeline collects Hubble flow logs for the workloads in the cluster. Flow logs are sent to the Gloo telemetry gateway where you can access them with the `meshctl hubble observe` command. Note that your cluster must be set up to use the Cilium CNI for flow logs to be collected. For more information, see Add Cilium flow logs.
Cilium metrics	`metrics/cilium`	N/A	The `metrics/cilium` pipeline collects Cilium, Hubble, and eBPF-specific metrics. Metrics are sent to the Gloo telemetry gateway where they are scraped by the built-in Prometheus server. You can access the metrics by using the Prometheus expression browser. Note that your cluster must be set up to use the Cilium CNI for Cilium metrics to be collected. For more information, see Add Cilium metrics.
Compute metadata	`metrics/otlp_relay`	N/A	The `metrics/otlp_relay` pipeline collects metadata about the compute instances, such as virtual machines, that the cluster is deployed to, and adds the metadata as labels on metrics. The metrics are sent to the Gloo telemetry gateway where they can be scraped by the built-in Prometheus server. For more information, see Collect compute instance metadata.
Insights	`logs/analyzer`	`logs/redis_stream`	The `logs/analyzer` pipeline is enabled by default, and collects analyzer results from the Gloo analyzer component and forwards them to the Gloo telemetry gateway. Analyzer results are then stored in Redis by using the `logs/redis_stream` pipeline so that they can be picked up by the insights engine in the Gloo management server to generate insights. If you follow the get started guide, the insights engine is already enabled in your environment, and analyzer results are collected by the Gloo telemetry pipeline. For more information, see Add Cilium insights.
Metrics	`metrics/ui`	`metrics/prometheus`	The `metrics/ui` pipeline is enabled by default and collects the metrics that are required for the Gloo UI graph. Metrics in the collector agent are then scraped by the built-in Prometheus server so that they can be provided to Gloo observability tools. To view the metrics that are captured with this pipeline, see Default metrics in the pipeline.

Default metrics in the pipeline

By default, the Gloo telemetry pipeline is configured to scrape the metrics that are required for the Gloo UI from various workloads in your cluster by using the metrics/ui and metrics/prometheus pipelines. The built-in Prometheus server is configured to scrape metrics from the Gloo collector agent (single cluster), or Gloo telemetry gateway and collector agent (multicluster). To reduce cardinality in the Gloo telemetry pipeline, only a few labels are collected for each metric. For more information, see Metric labels.

Review the metrics that are available in the Gloo telemetry pipeline. You can set up additional receivers to scrape other metrics, or forward the metrics to other observability tools, such as Datadog, by creating your own custom exporter for the Gloo telemetry gateway. To find an example setup, see Forward metrics to Datadog.

Cilium metrics

Metric	Description
`cilium_bpf_map_pressure`	The ratio of the required map size compared to its configured size. Values that are greater than or equal to 1.0 indicate that the map is full.
`cilium_drop_count_total`	The total number of dropped packages.
`cilium_endpoint_regeneration_time_stats_seconds`	The total time in seconds that the Cilium agent needed to generate Cilium endpoints.
`cilium_identity`	The number of identities that are currently allocated.
`cilium_node_connectivity_status`	The connectivity status of each node in the cluster.
`cilium_operator_ipam_ips`	The total number of used IP addresses that are currently in use.
`cilium_policy_endpoint_enforcement_status`	The number of endpoints that are labeled by the policy enforcement status.
`cilium_unreachable_nodes`	The number of nodes that are not reachable.
`hubble_flows_processed_total`	The total number of network flows that were processed by the Cilium agent.
`hubble_drop_total`	The total number of packages that were dropped by the Cilium agent.

Gloo management server metrics

Metric	Description
`gloo_mesh_build_snapshot_metric_time_sec`	The time in seconds for the Gloo management server to generate an output snapshot for connected Gloo agents.
`gloo_mesh_garbage_collection_time_sec`	The time it takes for the garbage collector to clean up unused resources in seconds, such as after the custom resource translation.
`gloo_mesh_reconciler_time_sec_bucket`	The time the Gloo management server needs to sync with the Gloo agents in the workload clusters to apply the translated resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 30, 50, 80, 100, 200.
`gloo_mesh_redis_relation_err_total`	The number of errors that occurred during a read or write operation of relationship data to Redis.
`gloo_mesh_redis_sync_err_total`	The number of times the Gloo management server could not read from or write to the Gloo Redis instance.
`gloo_mesh_redis_write_time_sec`	The time it takes in seconds for the Gloo management server to write to the Redis database.
`gloo_mesh_relay_client_delta_pull_time_sec`	The time it takes for a Gloo agent to receive a delta output snapshot from the Gloo management server in seconds.
`gloo_mesh_relay_client_delta_pull_err`	The number of errors that occurred while sending a delta output snapshot to a connected Gloo agent.
`gloo_mesh_relay_client_delta_push_time_sec`	The time it takes for a Gloo agent to send a delta input snapshot to the Gloo management server in seconds.
`gloo_mesh_relay_client_delta_push_err`	The number of errors that occurred while sending a delta input snapshot from the Gloo agent to the Gloo management server.
`gloo_mesh_snapshot_upserter_op_time_sec`	The time it takes for a snapshot to be updated and/or inserted in the Gloo management server local memory in seconds.
`gloo_mesh_safe_mode_active`	Indicates whether safe mode is enabled in the Gloo management server. For more information, see Redis safe mode options.
`gloo_mesh_translation_time_sec_bucket`	The time the Gloo management server needs to translate Gloo resources into Istio or Envoy resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 20, 25, 30, 45, 60, and 120.
`gloo_mesh_translator_concurrency`	The number of translation operations that the Gloo management server can perform at the same time.
`object_write_fails_total`	The number of times the Gloo agent tried to write invalid Istio configuration to the cluster that was rejected by the Istio control plane istiod.
`relay_pull_clients_connected`	The number of Gloo agents that are connected to the Gloo management server.
`relay_push_clients_warmed`	The number of Gloo agents that are ready to accept updates from the Gloo management server.
`solo_io_gloo_gateway_license`	The number of minutes until the Gloo Mesh Gateway license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration.
`solo_io_gloo_mesh_license`	The number of minutes until the Gloo Mesh Enterprise license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration.
`solo_io_gloo_network_license`	The number of minutes until the Gloo Network for Cilium license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration.
`translation_error`	The number of translation errors that were reported by the Gloo management server.
`translation_warning`	The number of translation warnings that were reported by the Gloo management server.

Gloo telemetry pipeline metrics

Metric	Description
`otelcol_processor_refused_metric_points`	The number of metrics that were refused by the Gloo telemetry pipeline processor. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources.
`otelcol_receiver_refused_metric_points`	The number of metrics that were refused by the Gloo telemetry pipeline receiver. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources.
`otelcol_processor_refused_spans`	The metric spans that were refused by the `memory_limiter` in the Gloo telemetry pipeline to prevent collector agents from being overloaded.
`otelcol_exporter_queue_capacity`	The amount of telemetry data that can be stored in memory while waiting on a worker in the collector agent to become available to send the data.
`otelcol_exporter_queue_size`	The amount of telemetry data that is currently stored in memory. If the size is equal or larger than `otelcol_exporter_queue_capacity`, new telemetry data is rejected.
`otelcol_loadbalancer_backend_latency`	The time the collector agents need to export telemetry data.
`otelcol_exporter_send_failed_spans`	The number of telemetry data spans that could not be sent to a backend.

Metrics labels

To reduce cardinality in the Gloo telemetry pipeline, only the following labels are collected for each metric.

Metric group	Labels
Istio	[“cluster”, “collector_pod” , “connection_security_policy”, “destination_cluster”, “destination_principal”, “destination_service”, “destination_workload”, “destination_workload_id”, “destination_workload_namespace”, “gloo_mesh”, “namespace”, “pod_name”, “reporter”, “response_code”, “source_cluster”, “source_principal”, “source_workload”, “source_workload_namespace”, “version”, “workload_id”]
Telemetry pipeline	[“app”, “cluster”, “collector_name”, “collector_pod”, “component”, “exporter”, “namespace”, “pod_template_generation”, “processor”, “service_version”]
Hubble	[“app”, “cluster”, “collector_pod”, “component”, “destination”, “destination_cluster”, “destination_pod”, “destination_workload”, “destination_workload_id”, “destination_workload_namespace”, “k8s_app”, “namespace”, “pod”, “protocol”, “source”, “source_cluster”, “source_pod”, “source_workload”, “source_workload_namespace”, “subtype”, “type”, “verdict”, “workload_id”]
Cilium*	[“action”, “address_type”, “api_call”, “app”, “arch”, “area”, “cluster”, “collector_pod”, “component”, “direction”, “endpoint_state”, “enforcement”, “equal”, “error”, “event_type”, “family”, “k8s_app”, “le”, “level”, “map_name”, “method”, “name”, “namespace”, “operation”, “outcome”, “path”, “pod”, “pod_template_generation”, “protocol”, “reason”, “return_code”, “revision”, “scope”, “source”, “source_cluster”, “source_node_name”, “status”, “subsystem”, “target_cluster”, “target_node_ip”, “target_node_name”, “target_node_type”, “type”, “valid”, “value”, “version”]
eBPF*	[“app”, “client_addr”, “cluster”, “code”, “collector_pod”, “component”, “destination”, “local_addr”, “namespace”, “pod”, “pod_template_generation”, “remote_identity”, “server_identity”, “source”]

* if enabled in Gloo telemetry pipeline

About the telemetry pipeline

Setup link

Disable the telemetry pipeline link

Customize the pipeline link

Architecture link

Built-in telemetry pipelines link

Default metrics in the pipeline link

Cilium metrics link

Gloo management server metrics link

Gloo telemetry pipeline metrics link

Metrics labels link