If you run into an issue with the Gloo telemetry pipeline, you can use the following resources to start debugging the issue:

In addition, check out the following resources from the upstream OpenTelemetry project:

Change the default log level

To start troubleshooting issues in your pipeline and to inspect the data that is processed by your collectors, you can change your pipeline log level to debug mode. The debug log level gives detailed information about the data that is received, processed, and exported by your pipeline.

  1. Add the following log level settings to your Helm values file. For single-cluster setups, these sections are configured in the same values file for your installation Helm release. For multicluster setups, configure these sections in separate values files for the management and workload clusters.

    • Helm release for the management server:
        telemetryGatewayCustomization:
        telemetry:
          logs:
            level: "debug"
        
    • Helm release for the agent:
        telemetryCollectorCustomization:
        telemetry:
          logs:
            level: "debug"
        
  2. Follow the Upgrade guide to apply the changes in your environment.

  3. Verify that the configmap for the telemetry gateway or collector agent pods is updated with the values you set in the values file.

      kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml --context ${MGMT_CONTEXT}
    kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml --context ${REMOTE_CONTEXT}
      
  4. Perform a rollout restart of the gateway deployment or the collector daemon set to force your configmap changes to be applied in the telemetry gateway or collector agent pods.

      kubectl rollout restart -n gloo-mesh deployment/gloo-telemetry-gateway --context ${MGMT_CONTEXT}
      
      kubectl rollout restart -n gloo-mesh daemonset/gloo-telemetry-collector-agent --context ${REMOTE_CONTEXT}
      

Monitor the health of receivers, exporters, and processors

The Gloo OpenTelemetry pipeline comes with built-in metrics that you can use to monitor the health of your pipeline. For example, you can use these metrics to verify that metrics are being scraped by the telemetry collector agents and sent to the telemetry gateway. You can also verify that the telemetry gateway exposes the metrics that were sent from the collector agents. Pipeline metrics are automatically populated to the Gloo operations dashboard.

Verify that metrics are scraped by the Gloo telemetry collector agents

The Gloo telemetry collector agents run statistics for the metrics that are being scraped. These statistics can be accessed under the /metrics path on port 8888. To verify that metrics are being scraped by the collector agents, you can monitor the following metrics:

  • otelcol_receiver_accepted_metric_points: The number of metric points that are successfully pushed into the telemetry pipeline.
  • otelcol_receiver_refused_metric_points: The number of metric points that could not be pushed into the telemetry pipeline.

To access these metrics:

  1. Get the name of a Gloo telemetry collector agent pod in your cluster.

      OTEL_AGENT_NAME=$(kubectl get pods -n gloo-mesh --context ${REMOTE_CONTEXT} \
      -l component=agent-collector \
      -o jsonpath="{.items[0].metadata.name}")
      
  2. Log in to the pod and retrieve the otelcol_receiver* metrics from the /metrics endpoint. In your CLI output, verify that the number of metrics that were successfully scraped (otelcol_receiver_accepted_metric_points) is higher than the metrics that failed to be scraped (otelcol_receiver_refused_metric_points).

      kubectl -n gloo-mesh debug --context ${REMOTE_CONTEXT} -q -i $OTEL_AGENT_NAME \
      --image=nicolaka/netshoot -- \
      curl localhost:8888/metrics | grep otelcol_receiver
      

    Example output:

       # HELP otelcol_receiver_accepted_metric_points Number of metric points successfully pushed into the pipeline.
       # TYPE otelcol_receiver_accepted_metric_points counter
       otelcol_receiver_accepted_metric_points{receiver="prometheus",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0",transport="http"} 86996
       # HELP otelcol_receiver_refused_metric_points Number of metric points that could not be pushed into the pipeline.
       # TYPE otelcol_receiver_refused_metric_points counter
       otelcol_receiver_refused_metric_points{receiver="prometheus",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0",transport="http"} 0
       

Verify that metrics are sent to the telemetry gateway

If metrics were successfully scraped by the Gloo telemetry collector agents, you can verify that these metrics were successfully sent to the telemetry gateway by using the following metrics:

  • otelcol_exporter_sent_metric_points: The number of metrics that were successfully sent to the Gloo telemetry gateway.
  • otelcol_exporter_send_failed_metric_points: The number of metrics that could not be sent to the Gloo telemetry gateway.

To access these metrics:

  1. Get the name of a Gloo telemetry collector agent pod in your cluster.

      OTEL_AGENT_NAME=$(kubectl get pods -n gloo-mesh --context ${REMOTE_CONTEXT} \
      -l component=agent-collector \
      -o jsonpath="{.items[0].metadata.name}")
      
  2. Log in to the pod and retrieve the otelcol_exporter* metrics from the /metrics endpoint. In your CLI output, verify that the number of metrics that were successfully sent to the Gloo telemetry gateway (otelcol_exporter_sent_metric_points) does not equal 0. If this number equals 0, no metrics were sent to the telemetry gateway. If a collector agent cannot send metrics to the telemetry gateway, the telemetry gateway endpoint might be incorrectly configured in the collector agents.

      kubectl -n gloo-mesh debug --context ${REMOTE_CONTEXT} -q -i $OTEL_AGENT_NAME \
      --image=nicolaka/netshoot -- \
      curl localhost:8888/metrics | grep "otelcol_exporter.*metric_points"
      

    Example output:

       # HELP otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue.
       # TYPE otelcol_exporter_enqueue_failed_metric_points counter
       otelcol_exporter_enqueue_failed_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 0
       # HELP otelcol_exporter_send_failed_metric_points Number of metric points in failed attempts to send to destination.
       # TYPE otelcol_exporter_send_failed_metric_points counter
       otelcol_exporter_send_failed_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 0
       # HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination.
       # TYPE otelcol_exporter_sent_metric_points counter
       otelcol_exporter_sent_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 9380
       

  3. If the number of metrics that were successfully sent to the telemetry gateway (otelcol_exporter_sent_metric_points) equals 0, verify that the correct gateway endpoint was configured in the collector agents.

    1. Get the telemetry gateway endpoint that was configured in the collector agents.

        kubectl -n gloo-mesh get cm gloo-telemetry-collector-config -o yaml --context ${REMOTE_CONTEXT} | \
      grep endpoint -C 5
        

      Example output:

                limit_percentage: 85
                  spike_limit_percentage: 10
      
              exporters:
                otlp:
                  endpoint: gloo-telemetry-gateway.gloo-mesh:4317
                  tls:
                    ca_file: /etc/otel-certs/ca.crt
                    server_name_override: gloo-telemetry-gateway.gloo-mesh
      
              extensions:
            

    2. Verify that the telemetry gateway port that you retrieved in the previous step matches the port that the telemetry gateway listens on.

        kubectl -n gloo-mesh get svc gloo-telemetry-gateway --context ${REMOTE_CONTEXT}
        

      Example output:

        NAME                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
      gloo-telemetry-gateway   ClusterIP   10.235.245.101   <none>        4317/TCP   5d23h
        
    3. If the ports match, get the logs of the collector agent pod.

        kubectl -n gloo-mesh --context ${REMOTE_CONTEXT} logs $OTEL_AGENT_NAME | grep error | grep grpc
        
    4. Look for connection errors, such as the following.

        2023-01-09T17:52:21.936Z	INFO	grpclog/grpclog.go:37	[core]pickfirstBalancer: UpdateSubConnState: 0xc001412990, {IDLE connection error: desc = "transport: Error while dialing dial tcp 172.18.0.2:32278: connect: connection refused"}	{"system": "grpc", "grpc_log": true}
        
  4. If you find that the endpoint is incorrectly configured, update that endpoint in the collector agent configmap.

    1. Open the collector agent configmap.

        kubectl edit configmap gloo-telemetry-collector-config -n gloo-mesh --context ${REMOTE_CONTEXT} 
        
    2. Update the endpoint and save your changes.

    3. Perform a rolling restart of the collector agent pods.

        kubectl rollout restart -n gloo-mesh --context ${REMOTE_CONTEXT} daemonset/gloo-telemetry-collector-agent 
        
    4. Follow the steps to get the Helm chart values]( /gloo-network/main//setup/upgrade/) of your current installation and make sure to save the telemetry endpoint in the telemetryCollector.config.exporters.otlp.endpoint field for future upgrades and installations.

Verify that the Gloo telemetry gateway exposes scraped metrics

If metrics are being successfully scraped from your pods and sent to the telemetry gateway, you can access these metrics on the /metrics path on port 9091 of the telemetry gateway.

  1. Get the name of the telemetry gateway pod.

      TELEMETRY_GATEWAY_NAME=$(kubectl get pods --context ${MGMT_CONTEXT} -n gloo-mesh -l component=standalone-collector -o jsonpath="{.items[0].metadata.name}")
      
  2. Log in to the telemetry gateway pod and retrieve the metrics that you are interested in.

View telemetry pipeline metrics in the operations dashboard

You can use the operations dashboard to quickly access metrics for the Gloo telemetry pipeline.

  1. Import the Gloo operations dashboard]( /gloo-network/main//telemetry/grafana/operations-dashboard/).
  2. In the Gloo Telemetry Pipeline card, look for increasing timeouts, failures, or other errors for any of the pipeline receivers, processors, or exporters.

For an overview of recommended metrics and alerts that you can use to monitor the pipeline’s health, see the OpenTelemetry documentation.

Changes in the metrics or collector agent configmap are not applied

The upstream OpenTelemetry project currently does not support reloading configuration changes dynamically and applying them in the gateway or collector agent pods. If you updated the configmap of the telemetry gateway or the collector agents, and the changes are not applied in the respective pods, you must perform a rollout restart of the gateway deployment or collector daemon set to apply the new changes.

  kubectl rollout restart -n gloo-mesh --context ${MGMT_CONTEXT} deployment/gloo-telemetry-gateway 
  
  kubectl rollout restart -n gloo-mesh --context ${REMOTE_CONTEXT} daemonset/gloo-telemetry-collector-agent