Multicluster management server and agents
In a multicluster setup, debug the management server, agents in connected clusters, and the relay connection between these components.
Debug the management server
Debug the Solo Enterprise for Istio management server.
Verify that the management server pod is running.
kubectl get pods -n gloo-mesh -l app=gloo-mesh-mgmt-server --context ${context1}If not, describe the pod and look for error messages. If you have multiple replicas, check each pod.
kubectl describe pod -n gloo-mesh -l app=gloo-mesh-mgmt-server --context ${context1}Optional: To increase the verbosity of the logs, you can patch the management server deployment. This way, you can review logs at the debug level, such as translation errors.
kubectl patch deploy -n gloo-mesh gloo-mesh-mgmt-server --context ${context1} --type "json" -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--verbose=true"}]'Check the logs of the management server pod. To view logs recorded since a relative duration such as 5s, 2m, or 3h, you can specify the
--since <duration>flag.meshctl logs mgmt -l error --kubecontext ${context1} [--since DURATION]Optionally, you can format the output with
jqor save it in a local file so that you can read and analyze the output more easily.meshctl logs mgmt -l error --kubecontext ${context1} | jq > mgmt-server-logs.jsonIn the logs, look for error messages. For example, you might see a message similar to the following.
Message Description Steps to resolve json: cannot unmarshal array into Go struct fieldThe configuration of a Solo Enterprise for Istio custom resource, such as an InsightsConfig resoure, does not match the expected configuration in the custom resource definition. The management server cannot apply the configuration. Review the configuration of the resource against the API reference, and try debugging the resource. For example, a field might be missing or have an incorrect value such as the wrong cluster name. If you recently upgraded the management server version, make sure that you reapply the CRDs. License is invalid or expired, crashing - license expiredYour provided Solo license is expired, and your management server is in a crash loop. See Update your license. You can also check the management server logs for other all log levels, such as
warn,debug, orinfo.meshctl logs mgmt --kubecontext ${context1} [--since DURATION]You can optionally generate a
.tar.gzfile of the operational information for your Solo Enterprise for Istio and Istio serice mesh to help debug issues in your environment. For more information, see the CLI reference.meshctl debug report- To generate a file for multiple clusters:
meshctl debug report --kubecontext ${context1},${context2} - To include information from application namespaces in addition to the
gloo-meshand Istio namespaces:meshctl debug report --include-namespaces app1,app2 - To upload the debug information to a secure repo owned by Solo.io, you can set a folder structure that makes it easy to identify your upload.
meshctl debug report --upload true --upload-dir <your_name>/<issue_name>
- To generate a file for multiple clusters:
Debug the agent
The agent reports the state of Istio and other resources in the workload cluster to the management server.
Verify that the agent pod is running.
kubectl get pods -n gloo-mesh --context ${context2}If not, describe the pod and look for error messages.
kubectl describe pod -n gloo-mesh -l app=gloo-mesh-agent --context ${context2}Check the agent logs. To view logs recorded since a relative duration such as 5s, 2m, or 3h, you can specify the
--since <duration>flag.meshctl logs agent -l error --kubecontext ${context2} [--since DURATION]Optionally, you can format the output with
jqor save it in a local file so that you can read and analyze the output more easily.meshctl logs agent -l error --kubecontext ${context2} | jq > agent-logs.jsonIn the logs, look for
"err",Err:, orErrormessages. For example, you might see a message similar to the following.Message Description Steps to resolve Err: connection error: desc = \"transport: Error while dialing dial tcp: missing address\"The agent does not have the correct address set for the management server. In your Helm settings file for the data plane release, compare the value for the serverAddresssetting with the IP address and port of the management server. If necessary, upgrade the data plane release in each connected cluster with the correct address.Waited for <time> due to client-side throttling, not priority and fairness, requestSolo Enterprise for Istio experienced a timeout when sending a request to the Kubernetes API server. For example, the Kubernetes etcd might be overloaded by the number of resources in the cluster. Wait to see if the error resolves as your Kubernetes cluster load reduces. Error: getting initial relay connection: context deadline exceededThe management server cannot set up a relay connection with the agent. The connection can fail for several reasons, such as pods or the service mesh being in an unhealthy state. Try debugging the management server and relay connection. transport: authentication handshake failed: x509: certificate signed by unknown authorityYour Solo Enterprise for Istio installation might have multiple certificates with different CAs. For example, you might have performed a Helm upgrade while using OpenTelemetry without disabling token and certificate regeneration. Review the Solo Support Center article (requires login). You can also check the agent logs for all other log levels, such as
warn,debug, orinfo.meshctl logs agent --kubecontext ${context2} [--since DURATION]
Debug the relay connection
Verify the relay connection between the management server and agents in connected clusters.
Verify that the management server and agent pods are running. If not, try troubleshooting the management server or agent.
kubectl get pods -n gloo-mesh --context ${context1} kubectl get pods -n gloo-mesh --context ${context2}Verify that the connected clusters are successfully identified by the management plane. This check might take a few seconds to ensure that the expected agent is running and is connected to the relay server in the management cluster.
meshctl check --kubecontext ${context1}Example output:
... 🟢 Mgmt server connectivity to workload agents Cluster | Registered | Connected Pod cluster2 | true | gloo-mesh/gloo-mesh-mgmt-server-676f4b9945-2pngd ...Check that the relay connection between the management server and agents is healthy.
- Forward port 9091 of the
gloo-mesh-mgmt-serverpod to your localhost.kubectl port-forward -n gloo-mesh --context ${context1} deploy/gloo-mesh-mgmt-server 9091 - In your browser, connect to http://localhost:9091/metrics.
- In the metrics UI, look for the following lines. If the values are
1, the agents in the connected workload clusters are successfully registered with the management server. If the values are0, the agents are not successfully connected.relay_pull_clients_connected{cluster="cluster1"} 1 relay_pull_clients_connected{cluster="cluster2"} 1 relay_push_clients_connected{cluster="cluster1"} 1 relay_push_clients_connected{cluster="cluster2"} 1 relay_push_clients_warmed{cluster="cluster1"} 1 relay_push_clients_warmed{cluster="cluster2"} 1 - Take snapshots in case you want to refer to the logs later, such as to open a Support issue.
curl localhost:9091/snapshots/input -o input_snapshot.json curl localhost:9091/snapshots/output -o output_snapshot.json
- Forward port 9091 of the
Check that the management services are running.
Send a gRPC request to the management server.
kubectl get secret --context ${context1} -n gloo-mesh relay-root-tls-secret -o json | jq -r '.data["ca.crt"]' | base64 -d > ca.crt grpcurl -authority gloo-mesh-mgmt-server.gloo-mesh --cacert=./ca.crt $MGMT_SERVER_NETWORKING_ADDRESS listVerify that the following services are listed.
envoy.service.accesslog.v3.AccessLogService envoy.service.metrics.v2.MetricsService envoy.service.metrics.v3.MetricsService grpc.reflection.v1alpha.ServerReflection relay.multicluster.skv2.solo.io.RelayCertificateService relay.multicluster.skv2.solo.io.RelayPullServer relay.multicluster.skv2.solo.io.RelayPushServer
Check the logs for the
gloo-mesh-mgmt-serverpod for communication from the agent on the workload cluster.meshctl logs mgmt -l error --kubecontext ${context1} | grep ${cluster2}Example output:
{"level":"debug","ts":1616160185.5505846,"logger":"pull-resource-deltas","msg":"recieved request for delta: response_nonce:\"1\"","metadata":{":authority":["gloo-mesh-mgmt-server.gloo-mesh.svc.cluster.local:11100"],"content-type":["application/grpc"],"user-agent":["grpc-go/1.34.0"],"x-cluster-id":["remote.cluster"]},"peer":"10.244.0.17:40074"}