Observability pipeline
Debug the Gloo Mesh Enterprise observability pipeline.
If you run into an issue with the Gloo telemetry pipeline, you can use the following sections to start debugging the issue.
In addition, check out the following resources from the upstream OpenTelemetry project:
- Upstream OpenTelemetry troubleshooting guide
- Recommended metrics and alerts to monitor the pipeline’s health
Change the default log level
To start troubleshooting issues in your pipeline and to inspect the data that is processed by your collectors, you can change your pipeline log level to debug mode. The debug log level gives detailed information about the data that is received, processed, and exported by your pipeline.
Add the following log level settings to your Helm values file. For single-cluster setups, these sections are configured in the same values file for your installation Helm release. For multicluster setups, configure these sections in separate values files for the management and workload clusters.
- Helm release for the management server:
telemetryGatewayCustomization: telemetry: logs: level: "debug"
- Helm release for the agent:
telemetryCollectorCustomization: telemetry: logs: level: "debug"
- Helm release for the management server:
Follow the Upgrade guide to apply the changes in your environment.
Verify that the configmap for the telemetry gateway or collector agent pods is updated with the values you set in the values file.
kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml --context ${MGMT_CONTEXT} kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml --context ${REMOTE_CONTEXT}
Perform a rollout restart of the gateway deployment or the collector daemon set to force your configmap changes to be applied in the telemetry gateway or collector agent pods.
kubectl rollout restart -n gloo-mesh deployment/gloo-telemetry-gateway --context ${MGMT_CONTEXT}
kubectl rollout restart -n gloo-mesh daemonset/gloo-telemetry-collector-agent --context ${REMOTE_CONTEXT}
Monitor the health of receivers, exporters, and processors
The Gloo OpenTelemetry pipeline comes with built-in metrics that you can use to monitor the health of your pipeline. For example, you can use these metrics to verify that metrics are being scraped by the telemetry collector agents and sent to the telemetry gateway. You can also verify that the telemetry gateway exposes the metrics that were sent from the collector agents. Pipeline metrics are automatically populated to the Gloo operations dashboard.
Verify that metrics are scraped by the Gloo telemetry collector agents
The Gloo telemetry collector agents run statistics for the metrics that are being scraped. These statistics can be accessed under the /metrics
path on port 8888. To verify that metrics are being scraped by the collector agents, you can monitor the following metrics:
otelcol_receiver_accepted_metric_points
: The number of metric points that are successfully pushed into the telemetry pipeline.otelcol_receiver_refused_metric_points
: The number of metric points that could not be pushed into the telemetry pipeline.
To access these metrics:
Get the name of a Gloo telemetry collector agent pod in your cluster.
OTEL_AGENT_NAME=$(kubectl get pods -n gloo-mesh --context ${REMOTE_CONTEXT} \ -l component=agent-collector \ -o jsonpath="{.items[0].metadata.name}")
Log in to the pod and retrieve the
otelcol_receiver*
metrics from the/metrics
endpoint. In your CLI output, verify that the number of metrics that were successfully scraped (otelcol_receiver_accepted_metric_points
) is higher than the metrics that failed to be scraped (otelcol_receiver_refused_metric_points
).kubectl -n gloo-mesh debug --context ${REMOTE_CONTEXT} -q -i $OTEL_AGENT_NAME \ --image=nicolaka/netshoot -- \ curl localhost:8888/metrics | grep otelcol_receiver
Example output:
# HELP otelcol_receiver_accepted_metric_points Number of metric points successfully pushed into the pipeline. # TYPE otelcol_receiver_accepted_metric_points counter otelcol_receiver_accepted_metric_points{receiver="prometheus",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0",transport="http"} 86996 # HELP otelcol_receiver_refused_metric_points Number of metric points that could not be pushed into the pipeline. # TYPE otelcol_receiver_refused_metric_points counter otelcol_receiver_refused_metric_points{receiver="prometheus",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0",transport="http"} 0
Verify that metrics are sent to the telemetry gateway
If metrics were successfully scraped by the Gloo telemetry collector agents, you can verify that these metrics were successfully sent to the telemetry gateway by using the following metrics:
otelcol_exporter_sent_metric_points
: The number of metrics that were successfully sent to the Gloo telemetry gateway.otelcol_exporter_send_failed_metric_points
: The number of metrics that could not be sent to the Gloo telemetry gateway.
To access these metrics:
Get the name of a Gloo telemetry collector agent pod in your cluster.
OTEL_AGENT_NAME=$(kubectl get pods -n gloo-mesh --context ${REMOTE_CONTEXT} \ -l component=agent-collector \ -o jsonpath="{.items[0].metadata.name}")
Log in to the pod and retrieve the
otelcol_exporter*
metrics from the/metrics
endpoint. In your CLI output, verify that the number of metrics that were successfully sent to the Gloo telemetry gateway (otelcol_exporter_sent_metric_points
) does not equal 0. If this number equals 0, no metrics were sent to the telemetry gateway. If a collector agent cannot send metrics to the telemetry gateway, the telemetry gateway endpoint might be incorrectly configured in the collector agents.kubectl -n gloo-mesh debug --context ${REMOTE_CONTEXT} -q -i $OTEL_AGENT_NAME \ --image=nicolaka/netshoot -- \ curl localhost:8888/metrics | grep "otelcol_exporter.*metric_points"
Example output:
# HELP otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue. # TYPE otelcol_exporter_enqueue_failed_metric_points counter otelcol_exporter_enqueue_failed_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 0 # HELP otelcol_exporter_send_failed_metric_points Number of metric points in failed attempts to send to destination. # TYPE otelcol_exporter_send_failed_metric_points counter otelcol_exporter_send_failed_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 0 # HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination. # TYPE otelcol_exporter_sent_metric_points counter otelcol_exporter_sent_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 9380
If the number of metrics that were successfully sent to the telemetry gateway (
otelcol_exporter_sent_metric_points
) equals 0, verify that the correct gateway endpoint was configured in the collector agents.Get the telemetry gateway endpoint that was configured in the collector agents.
kubectl -n gloo-mesh get cm gloo-telemetry-collector-config -o yaml --context ${REMOTE_CONTEXT} | \ grep endpoint -C 5
Example output:
limit_percentage: 85 spike_limit_percentage: 10 exporters: otlp: endpoint: gloo-telemetry-gateway.gloo-mesh:4317 tls: ca_file: /etc/otel-certs/ca.crt server_name_override: gloo-telemetry-gateway.gloo-mesh extensions:
Verify that the telemetry gateway port that you retrieved in the previous step matches the port that the telemetry gateway listens on.
kubectl -n gloo-mesh get svc gloo-telemetry-gateway --context ${REMOTE_CONTEXT}
Example output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gloo-telemetry-gateway ClusterIP 10.235.245.101 <none> 4317/TCP 5d23h
If the ports match, get the logs of the collector agent pod.
kubectl -n gloo-mesh --context ${REMOTE_CONTEXT} logs $OTEL_AGENT_NAME | grep error | grep grpc
Look for connection errors, such as the following.
2023-01-09T17:52:21.936Z INFO grpclog/grpclog.go:37 [core]pickfirstBalancer: UpdateSubConnState: 0xc001412990, {IDLE connection error: desc = "transport: Error while dialing dial tcp 172.18.0.2:32278: connect: connection refused"} {"system": "grpc", "grpc_log": true}
If you find that the endpoint is incorrectly configured, update that endpoint in the collector agent configmap.
Open the collector agent configmap.
kubectl edit configmap gloo-telemetry-collector-config -n gloo-mesh --context ${REMOTE_CONTEXT}
Update the endpoint and save your changes.
Perform a rolling restart of the collector agent pods.
kubectl rollout restart -n gloo-mesh --context ${REMOTE_CONTEXT} daemonset/gloo-telemetry-collector-agent
Follow the steps to get the Helm chart values of your current installation and make sure to save the telemetry endpoint in the
telemetryCollector.config.exporters.otlp.endpoint
field for future upgrades and installations.
Verify that the Gloo telemetry gateway exposes scraped metrics
If metrics are being successfully scraped from your pods and sent to the telemetry gateway, you can access these metrics on the /metrics
path on port 9091 of the telemetry gateway.
Get the name of the telemetry gateway pod.
TELEMETRY_GATEWAY_NAME=$(kubectl get pods --context ${MGMT_CONTEXT} -n gloo-mesh -l component=standalone-collector -o jsonpath="{.items[0].metadata.name}")
Log in to the telemetry gateway pod and retrieve the metrics that you are interested in. For example, to see the number of requests that were received by the ingress gateway, you can look at the
istio_requests_total
metric.kubectl -n gloo-mesh debug–context ${MGMT_CONTEXT} -q -i $TElEMETRY_GATEWAY_NAME --image=nicolaka/netshoot -- curl localhost:9091/metrics | grep istio_requests_total
Example output:
istio_requests_total{cluster="cluster-1",connection_security_policy="unknown",destination_workload_id="unknown.unknown.unknown",install_operator_istio_io_owning_resource="unknown",instance="10.232.0.52:15020",istio="ingressgateway",istio_io_rev="1-20",job="mesh-workloads",namespace="gloo-mesh-gateways",operator_istio_io_component="IngressGateways",pod_name="istio-ingressgateway-1-20-6cfb94798b-dt6fv",reporter="source",response_code="429",revision="1-20",sidecar_istio_io_inject="true",workload_id="istio-ingressgateway-1-20.gloo-mesh-gateways.graham0-portal"} 6
View telemetry pipeline metrics in the operations dashboard
You can use the operations dashboard to quickly access metrics for the Gloo telemetry pipeline.
- Import the Gloo operations dashboard.
- In the
Gloo Telemetry Pipeline
card, look for increasing timeouts, failures, or other errors for any of the pipeline receivers, processors, or exporters.
For an overview of recommended metrics and alerts that you can use to monitor the pipeline’s health, see the OpenTelemetry documentation.
Changes in the metrics or collector agent configmap are not applied
The upstream OpenTelemetry project currently does not support reloading configuration changes dynamically and applying them in the gateway or collector agent pods. If you updated the configmap of the telemetry gateway or the collector agents, and the changes are not applied in the respective pods, you must perform a rollout restart of the gateway deployment or collector daemon set to apply the new changes.
kubectl rollout restart -n gloo-mesh --context ${MGMT_CONTEXT} deployment/gloo-telemetry-gateway
kubectl rollout restart -n gloo-mesh --context ${REMOTE_CONTEXT} daemonset/gloo-telemetry-collector-agent