Skip to content
You are viewing the documentation for Solo Enterprise for Istio, formerly known as Gloo Mesh (OSS APIs).

Observability pipeline

Page as Markdown

Debug the Solo Enterprise for Istio observability pipeline.

If you run into an issue with the telemetry pipeline, you can use the following sections to start debugging the issue. In addition, check out the following resources from the upstream OpenTelemetry project:

Change the default log level

To start troubleshooting issues in your pipeline and to inspect the data that is processed by your collectors, you can change your pipeline log level to debug mode. The debug log level gives detailed information about the data that is received, processed, and exported by your pipeline.

  1. Add the following log level settings to your Helm values file. For single-cluster setups, these sections are configured in the same values file for your installation Helm release. For multicluster setups, configure these sections in separate values files for the management plane and data plane releases.

    • Helm release for the management plane:
      
      telemetryGatewayCustomization:
        telemetry:
          logs:
            level: "debug"
    • Helm release for the data plane:
      
      telemetryCollectorCustomization:
        telemetry:
          logs:
            level: "debug"
  2. Follow the Upgrade guide to apply the changes in your environment.

  3. Verify that the configmap for the telemetry gateway or collector agent pods is updated with the values you set in the values file.

    kubectl get configmap gloo-telemetry-gateway-config -n gloo-mesh -o yaml --context ${context1}
    kubectl get configmap gloo-telemetry-collector-config -n gloo-mesh -o yaml --context ${context2}
  4. Perform a rollout restart of the gateway deployment or the collector daemon set to force your configmap changes to be applied in the telemetry gateway or collector agent pods.

    kubectl rollout restart -n gloo-mesh deployment/gloo-telemetry-gateway --context ${context1}
    kubectl rollout restart -n gloo-mesh daemonset/gloo-telemetry-collector-agent --context ${context2}

Monitor the health of receivers, exporters, and processors

The OpenTelemetry pipeline comes with built-in metrics that you can use to monitor the health of your pipeline. For example, you can use these metrics to verify that metrics are being scraped by the telemetry collector agents and sent to the telemetry gateway. You can also verify that the telemetry gateway exposes the metrics that were sent from the collector agents.

Verify that metrics are scraped by the telemetry collector agents

The telemetry collector agents run statistics for the metrics that are being scraped. These statistics can be accessed under the /metrics path on port 8888. To verify that metrics are being scraped by the collector agents, you can monitor the following metrics:

  • otelcol_receiver_accepted_metric_points: The number of metric points that are successfully pushed into the telemetry pipeline.
  • otelcol_receiver_refused_metric_points: The number of metric points that could not be pushed into the telemetry pipeline.

To access these metrics:

  1. Get the name of a telemetry collector agent pod in your cluster. For multicluster setups, run this command in a connected workload cluster.

    OTEL_AGENT_NAME=$(kubectl get pods -n gloo-mesh --context ${context2} \
      -l component=agent-collector \
      -o jsonpath="{.items[0].metadata.name}")
  2. Log in to the pod and retrieve the otelcol_receiver* metrics from the /metrics endpoint. In your CLI output, verify that the number of metrics that were successfully scraped (otelcol_receiver_accepted_metric_points) is higher than the metrics that failed to be scraped (otelcol_receiver_refused_metric_points).

    kubectl -n gloo-mesh debug --context ${context2} -q -i $OTEL_AGENT_NAME \
      --image=nicolaka/netshoot -- \
      curl localhost:8888/metrics | grep otelcol_receiver

    Example output:

    # HELP otelcol_receiver_accepted_metric_points Number of metric points successfully pushed into the pipeline.
    # TYPE otelcol_receiver_accepted_metric_points counter
    otelcol_receiver_accepted_metric_points{receiver="prometheus",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0",transport="http"} 86996
    # HELP otelcol_receiver_refused_metric_points Number of metric points that could not be pushed into the pipeline.
    # TYPE otelcol_receiver_refused_metric_points counter
    otelcol_receiver_refused_metric_points{receiver="prometheus",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0",transport="http"} 0

Verify that metrics are sent to the telemetry gateway

In multicluster setups, you can verify that the metrics that were successfully scraped by the telemetry collector agents were also successfully sent to the telemetry gateway by using the following metrics:

  • otelcol_exporter_sent_metric_points: The number of metrics that were successfully sent to the telemetry gateway.
  • otelcol_exporter_send_failed_metric_points: The number of metrics that could not be sent to the telemetry gateway.

To access these metrics:

  1. Get the name of a telemetry collector agent pod in a conected workload cluster.

    OTEL_AGENT_NAME=$(kubectl get pods -n gloo-mesh --context ${context2} \
      -l component=agent-collector \
      -o jsonpath="{.items[0].metadata.name}")
  2. Log in to the pod and retrieve the otelcol_exporter* metrics from the /metrics endpoint. In your CLI output, verify that the number of metrics that were successfully sent to the telemetry gateway (otelcol_exporter_sent_metric_points) does not equal 0. If this number equals 0, no metrics were sent to the telemetry gateway. If a collector agent cannot send metrics to the telemetry gateway, the telemetry gateway endpoint might be incorrectly configured in the collector agents.

    kubectl -n gloo-mesh debug --context ${context2} -q -i $OTEL_AGENT_NAME \
      --image=nicolaka/netshoot -- \
      curl localhost:8888/metrics | grep "otelcol_exporter.*metric_points"

    Example output:

    # HELP otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue.
    # TYPE otelcol_exporter_enqueue_failed_metric_points counter
    otelcol_exporter_enqueue_failed_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 0
    # HELP otelcol_exporter_send_failed_metric_points Number of metric points in failed attempts to send to destination.
    # TYPE otelcol_exporter_send_failed_metric_points counter
    otelcol_exporter_send_failed_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 0
    # HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination.
    # TYPE otelcol_exporter_sent_metric_points counter
    otelcol_exporter_sent_metric_points{exporter="otlp",service_instance_id="bcd42792-3ded-40ba-bbc8-e10bb8551b96",service_name="gloo-otel-collector",service_version="2.3.0"} 9380
  3. If the number of metrics that were successfully sent to the telemetry gateway (otelcol_exporter_sent_metric_points) equals 0, verify that the correct gateway endpoint was configured in the collector agents.

    1. Get the telemetry gateway endpoint that was configured in the collector agents.

      kubectl -n gloo-mesh get cm gloo-telemetry-collector-config -o yaml --context ${context2} | \
      grep endpoint -C 5

      Example output:

          limit_percentage: 85
            spike_limit_percentage: 10
      
        exporters:
          otlp:
            endpoint: gloo-telemetry-gateway.gloo-mesh:4317
            tls:
              ca_file: /etc/otel-certs/ca.crt
              server_name_override: gloo-telemetry-gateway.gloo-mesh
      
        extensions:
    2. Verify that the telemetry gateway port that you retrieved in the previous step matches the port that the telemetry gateway listens on.

      kubectl -n gloo-mesh get svc gloo-telemetry-gateway --context ${context2}

      Example output:

      NAME                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
      gloo-telemetry-gateway   ClusterIP   10.235.245.101   <none>        4317/TCP   5d23h
    3. If the ports match, get the logs of the collector agent pod.

      kubectl -n gloo-mesh --context ${context2} logs $OTEL_AGENT_NAME | grep error | grep grpc
    4. Look for connection errors, such as the following.

      2023-01-09T17:52:21.936Z	INFO	grpclog/grpclog.go:37	[core]pickfirstBalancer: UpdateSubConnState: 0xc001412990, {IDLE connection error: desc = "transport: Error while dialing dial tcp 172.18.0.2:32278: connect: connection refused"}	{"system": "grpc", "grpc_log": true}
  4. If the endpoint is incorrectly configured, update that endpoint in the collector agent configmap.

    1. Open the collector agent configmap.

      kubectl edit configmap gloo-telemetry-collector-config -n gloo-mesh --context ${context2}
    2. Update the endpoint and save your changes.

    3. Perform a rolling restart of the collector agent pods.

      kubectl rollout restart -n gloo-mesh --context ${context2} daemonset/gloo-telemetry-collector-agent 
    4. Follow the steps to get the Helm chart values of your current installation and make sure to save the telemetry endpoint in the telemetryCollector.config.exporters.otlp.endpoint field for future upgrades and installations.

Verify that the telemetry gateway exposes scraped metrics

In multicluster setups, if metrics are being successfully scraped from your pods and sent to the telemetry gateway, you can access these metrics on the /metrics path on port 9091 of the telemetry gateway.

  1. Get the name of the telemetry gateway pod in the cluster where the management server is deployed.

    TELEMETRY_GATEWAY_NAME=$(kubectl get pods --context ${context1} -n gloo-mesh -l component=standalone-collector -o jsonpath="{.items[0].metadata.name}")
  2. Log in to the telemetry gateway pod and retrieve the metrics that you are interested in. For example, to see the number of requests that were received by the ingress gateway, you can look at the istio_requests_total metric.

    kubectl -n gloo-mesh debug --context ${context1} -q -i $TElEMETRY_GATEWAY_NAME --image=nicolaka/netshoot -- curl localhost:9091/metrics | grep istio_requests_total

    Example output:

    istio_requests_total{cluster="cluster-1",connection_security_policy="unknown",destination_workload_id="unknown.unknown.unknown",install_operator_istio_io_owning_resource="unknown",instance="10.232.0.52:15020",istio="ingressgateway",istio_io_rev="main",job="mesh-workloads",namespace="istio-ingress",operator_istio_io_component="IngressGateways",pod_name="istio-ingressgateway-main-6cfb94798b-dt6fv",reporter="source",response_code="429",revision="main",sidecar_istio_io_inject="true",workload_id="istio-ingressgateway-main.istio-ingress.mycluster-portal"} 6

Changes in the metrics or collector agent configmap are not applied

The upstream OpenTelemetry project currently does not support reloading configuration changes dynamically and applying them in the gateway or collector agent pods. If you updated the configmap of the telemetry gateway or the collector agents, and the changes are not applied in the respective pods, you must perform a rollout restart of the gateway deployment or collector daemon set to apply the new changes.

kubectl rollout restart -n gloo-mesh --context ${context1} deployment/gloo-telemetry-gateway 
kubectl rollout restart -n gloo-mesh --context ${context2} daemonset/gloo-telemetry-collector-agent