About the telemetry pipeline
Learn about the Gloo telemetry pipeline architecture, its components, and default pipelines that you can choose from.
You can gain insights into the health and performance of your cluster components by using the Gloo telemetry pipeline. Built on top of the OpenTelemetry open source project, the Gloo telemetry pipeline helps you to collect and export telemetry data, such as metrics, logs, and Gloo insights, and to visualize this data by using Gloo observability tools.
Review the information on this page to learn more about the Gloo telemetry pipeline and how to use it in your cluster.
Setup
The Gloo telemetry pipeline is set up by default if you followed one of the installation guides:
- Use a
meshctl
installation profile, such as in the Get started guide. - Use Helm to install Gloo Network.
To see the receivers, processors, and exporters that are set up by default for you, run the following commands:
kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml
kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml
Disable the telemetry pipeline
If you want to disable the Gloo telemetry pipeline, follow the Upgrade guide and add the following configuration to your Helm values file:
telemetryCollector:
enabled: false
telemetryGateway:
enabled: false
Disabling the Gloo telemetry pipeline removes the Gloo telemetry gateway and collector agent pods from your cluster. If you previously collected telemetry data, and data was not exported to a different observability tool, all telemetry data is removed. To keep telemetry data, consider exporting data to other observability tools, such as Prometheus, Jaeger, or your own before disabling the telemetry pipeline.
Customize the pipeline
You can customize the Gloo telemetry pipeline and set up additional receivers, processors, and exporters in your pipeline. The Gloo telemetry pipeline is set up with pre-built pipelines that use a variety of receivers, processors, and exporters to collect and store telemetry data in your cluster. You can enable and disable these pipelines as part of your Helm installation.
Because the Gloo telemetry pipeline is built on top of the OpenTelemetry open source project, you also have the option to add your own custom receivers, processors, and exporters to the pipeline. For more information, see the pipeline architecture information in the OpenTelemetry documentation.
To see the receivers, processors, and exporters that are set up by default for you, run the following commands:
kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml
kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml
To add more telemetry data to the Gloo telemetry pipeline, see Customize the pipeline.
Architecture
The Gloo telemetry pipeline is decoupled from the Gloo agents and management server core functionality, and consists of two main components: the Gloo telemetry collector agent and telemetry gateway.
Flip through the cards to see how these components are set up in a single and multicluster environment.
The diagram shows the default ports that are added as prometheus.io/port: "<port_number>"
pod annotations to the workloads that expose metrics. This port is automatically used by the Gloo collector agent, Gloo telemetry gateway, and Prometheus to scrape the metrics from these workloads. You can change the port by changing the pod annotation. However, keep in mind that changing the default scraping ports might lead to unexpected results, because Gloo Network processes might depend on the default setting.
Learn more about the telemetry data that is collected in the Gloo telemetry pipeline.
Built-in telemetry pipelines
The Gloo telemetry pipeline is set up with default pipelines that you can enable to collect telemetry data in your cluster.
Default metrics in the pipeline
By default, the Gloo telemetry pipeline is configured to scrape the metrics that are required for the Gloo UI from various workloads in your cluster by using the metrics/ui
and metrics/prometheus
pipelines. The built-in Prometheus server is configured to scrape metrics from the Gloo collector agent (single cluster), or Gloo telemetry gateway and collector agent (multicluster). To reduce cardinality in the Gloo telemetry pipeline, only a few labels are collected for each metric. For more information, see Metric labels.
Review the metrics that are available in the Gloo telemetry pipeline. You can set up additional receivers to scrape other metrics, or forward the metrics to other observability tools, such as Datadog, by creating your own custom exporter for the Gloo telemetry gateway. To find an example setup, see Forward metrics to Datadog.
Cilium metrics
Metric | Description |
---|---|
cilium_bpf_map_pressure | The ratio of the required map size compared to its configured size. Values that are greater than or equal to 1.0 indicate that the map is full. |
cilium_drop_count_total | The total number of dropped packages. |
cilium_endpoint_regeneration_time_stats_seconds | The total time in seconds that the Cilium agent needed to generate Cilium endpoints. |
cilium_identity | The number of identities that are currently allocated. |
cilium_node_connectivity_status | The connectivity status of each node in the cluster. |
cilium_operator_ipam_ips | The total number of used IP addresses that are currently in use. |
cilium_policy_endpoint_enforcement_status | The number of endpoints that are labeled by the policy enforcement status. |
cilium_unreachable_nodes | The number of nodes that are not reachable. |
hubble_flows_processed_total | The total number of network flows that were processed by the Cilium agent. |
hubble_drop_total | The total number of packages that were dropped by the Cilium agent. |
Gloo management server metrics
Metric | Description |
---|---|
gloo_mesh_build_snapshot_metric_time_sec | The time in seconds for the Gloo management server to generate an output snapshot for connected Gloo agents. |
gloo_mesh_garbage_collection_time_sec | The time it takes for the garbage collector to clean up unused resources in seconds, such as after the custom resource translation. |
gloo_mesh_reconciler_time_sec_bucket | The time the Gloo management server needs to sync with the Gloo agents in the workload clusters to apply the translated resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 30, 50, 80, 100, 200. |
gloo_mesh_redis_relation_err_total | The number of errors that occurred during a read or write operation of relationship data to Redis. |
gloo_mesh_redis_sync_err_total | The number of times the Gloo management server could not read from or write to the Gloo Redis instance. |
gloo_mesh_redis_write_time_sec | The time it takes in seconds for the Gloo management server to write to the Redis database. |
gloo_mesh_relay_client_delta_pull_time_sec | The time it takes for a Gloo agent to receive a delta output snapshot from the Gloo management server in seconds. |
gloo_mesh_relay_client_delta_pull_err | The number of errors that occurred while sending a delta output snapshot to a connected Gloo agent. |
gloo_mesh_relay_client_delta_push_time_sec | The time it takes for a Gloo agent to send a delta input snapshot to the Gloo management server in seconds. |
gloo_mesh_relay_client_delta_push_err | The number of errors that occurred while sending a delta input snapshot from the Gloo agent to the Gloo management server. |
gloo_mesh_snapshot_upserter_op_time_sec | The time it takes for a snapshot to be updated and/or inserted in the Gloo management server local memory in seconds. |
gloo_mesh_safe_mode_active | Indicates whether safe mode is enabled in the Gloo management server. For more information, see Redis safe mode options. |
gloo_mesh_translation_time_sec_bucket | The time the Gloo management server needs to translate Gloo resources into Istio or Envoy resources. This metric is captured in seconds for the following intervals (buckets): 1, 2, 5, 10, 15, 20, 25, 30, 45, 60, and 120. |
gloo_mesh_translator_concurrency | The number of translation operations that the Gloo management server can perform at the same time. |
object_write_fails_total | The number of times the Gloo agent tried to write invalid Istio configuration to the cluster that was rejected by the Istio control plane istiod. |
relay_pull_clients_connected | The number of Gloo agents that are connected to the Gloo management server. |
relay_push_clients_warmed | The number of Gloo agents that are ready to accept updates from the Gloo management server. |
solo_io_gloo_gateway_license | The number of minutes until the Gloo Mesh Gateway license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration. |
solo_io_gloo_mesh_license | The number of minutes until the Gloo Mesh Enterprise license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration. |
solo_io_gloo_network_license | The number of minutes until the Gloo Network for Cilium license expires. To prevent your management server from crashing when the license expires, make sure to upgrade the license before expiration. |
translation_error | The number of translation errors that were reported by the Gloo management server. |
translation_warning | The number of translation warnings that were reported by the Gloo management server. |
Gloo telemetry pipeline metrics
Metric | Description |
---|---|
otelcol_processor_refused_metric_points | The number of metrics that were refused by the Gloo telemetry pipeline processor. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources. |
otelcol_receiver_refused_metric_points | The number of metrics that were refused by the Gloo telemetry pipeline receiver. For example, metrics might be refused to prevent collector agents from being overloaded in the case of insufficient memory resources. |
otelcol_processor_refused_spans | The metric spans that were refused by the memory_limiter in the Gloo telemetry pipeline to prevent collector agents from being overloaded. |
otelcol_exporter_queue_capacity | The amount of telemetry data that can be stored in memory while waiting on a worker in the collector agent to become available to send the data. |
otelcol_exporter_queue_size | The amount of telemetry data that is currently stored in memory. If the size is equal or larger than otelcol_exporter_queue_capacity , new telemetry data is rejected. |
otelcol_loadbalancer_backend_latency | The time the collector agents need to export telemetry data. |
otelcol_exporter_send_failed_spans | The number of telemetry data spans that could not be sent to a backend. |
Metrics labels
To reduce cardinality in the Gloo telemetry pipeline, only the following labels are collected for each metric.
Metric group | Labels |
---|---|
Istio | [“cluster”, “collector_pod” , “connection_security_policy”, “destination_cluster”, “destination_principal”, “destination_service”, “destination_workload”, “destination_workload_id”, “destination_workload_namespace”, “gloo_mesh”, “namespace”, “pod_name”, “reporter”, “response_code”, “source_cluster”, “source_principal”, “source_workload”, “source_workload_namespace”, “version”, “workload_id”] |
Telemetry pipeline | [“app”, “cluster”, “collector_name”, “collector_pod”, “component”, “exporter”, “namespace”, “pod_template_generation”, “processor”, “service_version”] |
Hubble | [“app”, “cluster”, “collector_pod”, “component”, “destination”, “destination_cluster”, “destination_pod”, “destination_workload”, “destination_workload_id”, “destination_workload_namespace”, “k8s_app”, “namespace”, “pod”, “protocol”, “source”, “source_cluster”, “source_pod”, “source_workload”, “source_workload_namespace”, “subtype”, “type”, “verdict”, “workload_id”] |
Cilium* | [“action”, “address_type”, “api_call”, “app”, “arch”, “area”, “cluster”, “collector_pod”, “component”, “direction”, “endpoint_state”, “enforcement”, “equal”, “error”, “event_type”, “family”, “k8s_app”, “le”, “level”, “map_name”, “method”, “name”, “namespace”, “operation”, “outcome”, “path”, “pod”, “pod_template_generation”, “protocol”, “reason”, “return_code”, “revision”, “scope”, “source”, “source_cluster”, “source_node_name”, “status”, “subsystem”, “target_cluster”, “target_node_ip”, “target_node_name”, “target_node_type”, “type”, “valid”, “value”, “version”] |
eBPF* | [“app”, “client_addr”, “cluster”, “code”, “collector_pod”, “component”, “destination”, “local_addr”, “namespace”, “pod”, “pod_template_generation”, “remote_identity”, “server_identity”, “source”] |
* if enabled in Gloo telemetry pipeline