Generate relay certificates with ACM

Set up the relay root and intermediate certificate authorities (CAs) to generate the relay server certificate and relay agent client certificates. These certificates are required to secure communication between the Gloo Mesh management and data planes.

The following steps contain example configurations to generate each relay certificate manually. You can use these example configurations as a starting point to create CAs in your own public key infrastructure (PKI), such as AWS Certificate Manager (ACM).

This setup is recommended for production use because the CA private key is stored in ACM apart from your cluster, which enhances the security of your setup.

Before you begin

  1. Install cert-manager.

    Example command to install the cert-manager in the management cluster:

    kubectl --context ${MGMT_CONTEXT} apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.4/cert-manager.yaml
    
  2. Save the kubeconfig contexts for your clusters. Run kubectl config get-contexts, look for your cluster in the CLUSTER column, and get the context name in the NAME column. Note: Do not use context names with underscores. The context name is used as a SAN specification in the generated certificate that connects workload clusters to the management cluster, and underscores in SAN are not FQDN compliant. You can rename a context by running kubectl config rename-context "<oldcontext>" <newcontext>.
    export MGMT_CLUSTER=<mgmt-cluster-name>
    export REMOTE_CLUSTER=<remote-cluster-name>
    export MGMT_CONTEXT=<management-cluster-context>
    export REMOTE_CONTEXT=<remote-cluster-context>
    

Generate the relay root CA

Create and securely store the relay root CA in AWS Certificate Manager Private Certificate Authority (AWS ACM PCA). This method is recommended for production use.

  1. If it doesn't already exist, create the gloo-mesh namespace.

    kubectl create namespace gloo-mesh --context $MGMT_CONTEXT
    
  2. Create the root CA in ACM PCA.

    # Generate CA config file
    echo '''
    {
       "KeyAlgorithm":"RSA_2048",
       "SigningAlgorithm":"SHA256WITHRSA",
       "Subject":{
          "CommonName":"relay-root-ca"
       }
    }
    ''' > ca_config.json
    # Create CA in ACM PCA
    REGION=us-east-1
    CA_ARN=$(aws acm-pca create-certificate-authority \
         --certificate-authority-configuration file://ca_config.json \
         --certificate-authority-type "ROOT" \
         --idempotency-token 01234567 \
         --region $REGION \
         --tags  Key=Name,Value=relay-root-ca| jq -r '.CertificateAuthorityArn')
    # Example response payload 
    # {
    #   "CertificateAuthorityArn": "arn:aws:acm-pca:us-east-1:123456789:certificate-authority/123456789-debf-4513-89f7-c1834d5ffbd5"
    # }
    
    # download Root CA CSR from AWS
    aws acm-pca get-certificate-authority-csr \
        --region $REGION \
        --certificate-authority-arn $CA_ARN \
        --output text > relay-root-ca.csr
    
    # Issue Root Certificate
    ISSUE_CERTIFICATE_RESPONSE=$(aws acm-pca issue-certificate \
        --certificate-authority-arn $CA_ARN \
        --csr fileb://relay-root-ca.csr \
        --region $REGION \
        --signing-algorithm "SHA256WITHRSA" \
        --template-arn arn:aws:acm-pca:::template/RootCACertificate/V1 \
        --validity Value=3650,Type="DAYS" \
        --idempotency-token 1234567 \
        --output json | jq -r '.CertificateArn')
    
    CERTARN=$ISSUE_CERTIFICATE_RESPONSE
    
    # Download Certificate
    aws acm-pca get-certificate \
        --certificate-authority-arn $CA_ARN \
        --certificate-arn $CERTARN \
        --region $REGION \
        --output text > relay-root-ca.pem
    
    # Upload certificate to AWS
    aws acm-pca import-certificate-authority-certificate \
        --certificate-authority-arn $CA_ARN \
        --region $REGION \
        --certificate fileb://relay-root-ca.pem
    
  3. Create an AWS user for the relay root CA.

    aws iam create-user gloo-mesh-acm --region $REGION
    
  4. Create an IAM policy for the relay root CA.

    $POLICY_ARN=$(aws iam create-policy \
        --policy-name GlooMeshRelayCA \
        --policy-document \
    '{
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "awspcaissuer",
          "Action": [
            "acm-pca:DescribeCertificateAuthority",
            "acm-pca:GetCertificate",
            "acm-pca:IssueCertificate"
          ],
          "Effect": "Allow",
          "Resource": "$CA_ARN"
        }
      ]
    }' | jq -r '.Policy.Arn')
    
    • Note: For cross account ACM management, see Resource-based policies in the AWS docs.
    • For an example template, download this PCACrossAccountPolicy.json file and apply it via the CLI.
      aws acm-pca put-policy --region ${region} --resource-arn arn:aws:acm-pca:us-east-1:${control_plane_account}:certificate-authority/${cert_authority_id} --policy file://PCACrossAccountPolicy.json
      
  5. Add the policy to the gloo-mesh-acm user.

    aws iam attach-user-policy --policy-arn $POLICY_ARN --user-name gloo-mesh-acm
    
  6. Generate an access keypair for the user, and add the keypair to your cluster.

    aws iam create-access-key --user-name gloo-mesh-acm
    # Example Reponse
    # {
    #     "AccessKey": {
    #         "UserName": "Bob",
    #         "Status": "Active",
    #         "CreateDate": "2015-03-09T18:39:23.411Z",
    #         "SecretAccessKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYzEXAMPLEKEY",
    #         "AccessKeyId": "AKIAIOSFODNN7EXAMPLE"
    #     }
    # }
    
    #upload secrets to kubernetes
    
    AWS_ACCESS_KEY_ID=<key_id>
    SECRET_ACCESS_KEY=<secret>
    
    kubectl create secret generic gloo-mesh-acm-credentials \
      --namespace gloo-mesh \
      --from-literal=AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
      --from-literal=AWS_SECRET_ACCESS_KEY=$SECRET_ACCESS_KEY \
      --context $MGMT_CONTEXT
    
  7. Create a cert-manager issuer for the CA.

    cat << EOF | kubectl apply --context $MGMT_CONTEXT -f -
    apiVersion: awspca.cert-manager.io/v1beta1
    kind: AWSPCAIssuer
    metadata:
      name: relay-root-ca
      namespace: gloo-mesh
    spec:
      arn: $CA_ARN
      region: $REGION
      secretRef:
        namespace: gloo-mesh
        name: gloo-mesh-acm-credentials
    EOF
    

Create the networking server certificates

To generate the gloo-mesh-mgmt-server server certificates, create a cert-manager certificate and refer to the ACM PCA issuer.

cat << EOF | kubectl apply --context $MGMT_CONTEXT -f -
kind: Certificate
apiVersion: cert-manager.io/v1
metadata:
  name: gloo-mesh-mgmt-server
  namespace: gloo-mesh
spec:
  commonName: gloo-mesh-mgmt-server
  dnsNames:
    - "*.gloo-mesh"
  # 1 year life
  duration: 8760h0m0s
  issuerRef:
    group: awspca.cert-manager.io
    kind: AWSPCAIssuer
    name: relay-root-ca
  renewBefore: 8736h0m0s
  secretName: relay-server-tls-secret
  usages:
    - server auth
    - client auth
  privateKey:
    algorithm: "RSA"
    size: 4096
EOF

Create the agent certificates

Generate an gloo-mesh-agent client certificate for each workload cluster. Be sure to repeat these steps for each workload cluster that you plan to register with Gloo Mesh.

  1. Create a cert-manager certificate and refer to the ACM PCA issuer.

    CLUSTER_NAME=$REMOTE_CLUSTER
    CLUSTER_CONTEXT=$REMOTE_CONTEXT
    
    cat << EOF | kubectl apply --context $MGMT_CONTEXT -f -
    kind: Certificate
    apiVersion: cert-manager.io/v1
    metadata:
      name: gloo-mesh-agent-$CLUSTER_NAME
      namespace: gloo-mesh
    spec:
      commonName: gloo-mesh-agent-$CLUSTER_NAME
      dnsNames:
        # Must match the cluster name used in the helm chart install
        - "$CLUSTER_NAME"
      # 1 year life
      duration: 8760h0m0s
      issuerRef:
        group: awspca.cert-manager.io
        kind: AWSPCAIssuer
        name: relay-root-ca
      renewBefore: 8736h0m0s
      secretName: gloo-mesh-agent-$CLUSTER_NAME-tls-cert
      usages:
        - server auth
        - client auth
      privateKey:
        algorithm: "RSA"
        size: 4096
    EOF
    
  2. Copy the TLS secret to the workload cluster.

    kubectl get secret gloo-mesh-agent-$CLUSTER_NAME-tls-cert \
      --namespace gloo-mesh \
      --output json \
      --context $MGMT_CONTEXT \
      | jq 'del(.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' \
      | kubectl apply --context $CLUSTER_CONTEXT -f -
    

Verify the cert-manager resources

For clusters that have cert-manager installed, verify that your cert-manager issuer and certificate resources are ready. If the READY column says False for any of the following resources, describe the resource for more details and resolve the issue before continuing.

kubectl get issuer -n gloo-mesh --context $MGMT_CONTEXT
kubectl get certificates -n gloo-mesh --context $MGMT_CONTEXT

Now that your custom certificates are created, continue to the next section to modify your Gloo Mesh deployment to use these certificates.

Modifying the Gloo Mesh installation Helm charts

You must modify the Gloo Mesh installation and registration Helm charts to use your custom CAs and certificates instead of the default certificates that are generated and managed by Gloo Mesh.

If you already installed Gloo Mesh via Helm, you can upgrade the Helm installation to use these Helm values instead.

gloo-mesh-enterprise Helm chart

Modify and install the gloo-mesh-enterprise Helm chart in your management cluster.

  1. Prepare the Helm installation settings for the gloo-mesh-enterprise chart. Note that you might need to update the values, depending on your certificate setup.

    
       # Set to true to permit unencrypted and unauthenticated communication between management plane and data planes.
       insecure: false
       # Name of the management plane cluster.
       mgmtClusterName: $MGMT_CLUSTER
       global:
          cluster: $MGMT_CLUSTER
       # Set up details for the relay certificates.
       relay:
         # Set to true to disable gloo-mesh-mgmt-server relay CA functionality. 
         # Set to true to disable the default self-signed relay certificates and instead use your own.
         disableCa: true
         # Set to true to disable automatically generating self-signed certificates.
         disableCaCertGeneration: true
         # Reference to a Secret containing TLS Certificates used to secure the Gloo Mesh gRPC relay server with TLS.
         tlsSecret:
           name: relay-server-tls-secret
       
    
       --set glooMeshLicenseKey=${GLOO_MESH_LICENSE_KEY} \
       --set glooMeshMgmtServer.relay.tlsSecret.name=relay-server-tls-secret \
       --set glooMeshMgmtServer.relay.disableCaCertGeneration=true \
       --set glooMeshMgmtServer.relay.disableCa=true
       

  2. Modify the Gloo Mesh Enterprise Helm chart installation with the updated settings.

gloo-mesh-agent Helm chart

Modify and install the gloo-mesh-agent Helm chart in each workload cluster.

  1. With the Kubernetes context still set to your management cluster, get the IP address of the gloo-mesh-mgmt-server deployment.

    MGMT_INGRESS_ADDRESS=$(kubectl get svc -n gloo-mesh gloo-mesh-mgmt-server --context ${MGMT_CONTEXT} -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    MGMT_INGRESS_PORT=$(kubectl -n gloo-mesh get service gloo-mesh-mgmt-server --context ${MGMT_CONTEXT} -o jsonpath='{.spec.ports[?(@.name=="grpc")].port}')
    MGMT_SERVER_NETWORKING_ADDRESS=${MGMT_INGRESS_ADDRESS}:${MGMT_INGRESS_PORT}
    
  2. Create a values.yaml file to upgrade your Gloo Mesh Enterprise Helm chart installation. Note that you might need to update the values, depending on your certificate setup.

    
       # Set to true to permit unencrypted and unauthenticated communication between management plane and data planes.
       insecure: false
       relay:
         # SNI name used to connect to relay forwarding server, mustmatch server certificate CommonName (DO NOT CHANGE)
         authority: gloo-mesh-mgmt-server.gloo-mesh
         # gloo-mesh-mgmt-server IP address
         serverAddress: $MGMT_SERVER_NETWORKING_ADDRESS
         # Reference to a secret containing the client TLS certs used to identify the relay agent to the relay server.
         clientTlsSecret:
           name: gloo-mesh-agent-$REMOTE_CLUSTER-tls-cert
           namespace: gloo-mesh
         # Reference to a secret containing a root TLS cert used to verify the relay server cert. 
         # The secret can also optionally specify a 'tls.key' which will be used to generate the agent client cert.
         rootTlsSecret:
           name: relay-root-tls-secret
           namespace: gloo-mesh
       
    
       --set relay.serverAddress=${MGMT_SERVER_NETWORKING_ADDRESS} \
       --set relay.clientTlsSecret.name=gloo-mesh-agent-$REMOTE_CLUSTER-tls-cert \
       --set relay.clientTlsSecret.namespace=gloo-mesh \
       --set relay.rootTlsSecret.name=relay-root-tls-secret \
       --set relay.rootTlsSecret.namespace=gloo-mesh
       

  3. Modify the Gloo Mesh Enterprise Helm chart installation with the updated settings in your values.yaml file.

  4. Allow the Gloo Mesh management plane to use the relay certificates to connect to the agents. The steps vary depending on whether you are installing Gloo Mesh for the first time, or upgrading an existing installation.

    Create a KubernetesCluster object in the management cluster for your workload cluster.

    kubectl apply --context $MGMT_CONTEXT -f- <<EOF
    apiVersion: admin.gloo.solo.io/v2
    kind: KubernetesCluster
    metadata:
      name: ${REMOTE_CLUSTER}
      namespace: gloo-mesh
    spec:
      clusterDomain: cluster.local
    EOF
    

    Reload the gloo-mesh-mgmt-server and gloo-mesh-agent pods to pick up the new certificates.

    Restarting the Gloo Mesh pods does not impact your running apps. However, you cannot change the configuration of Gloo Mesh resources, such as to control traffic policies, until the pods are healthy again.

    1. Get the name of the gloo-mesh-mgmt-server pod in your management cluster.
      kubectl get pods -n gloo-mesh --context $MGMT_CONTEXT
      
    2. Restart the gloo-mesh-mgmt-server pod.
      kubectl delete pod -n gloo-mesh --context $MGMT_CONTEXT <gloo-mesh-mgmt-server-pod>
      
    3. Get the name of the gloo-mesh-agent pod in your workload cluster.
      kubectl get pods -n gloo-mesh --context $REMOTE_CONTEXT
      
    4. Restart the gloo-mesh-agent pod.
      kubectl delete pod -n gloo-mesh --context $REMOTE_CONTEXT <gloo-mesh-agent-pod>
      

  5. Repeat these steps for each workload cluster.

Verifying your relay certificate setup

  1. Check that the relay connection between the management server and workload agents is healthy.
    1. Forward port 9091 of the gloo-mesh-mgmt-server pod to your localhost.
      kubectl port-forward -n gloo-mesh --context $MGMT_CONTEXT deploy/gloo-mesh-mgmt-server 9091
      
    2. In your browser, connect to http://localhost:9091/metrics.
    3. In the metrics UI, look for the following lines. If the values are 1, the agents in the workload clusters are successfully registered with the management server. If the values are 0, the agents are not successfully connected.
      relay_pull_clients_connected{cluster="cluster-1"} 1
      relay_pull_clients_connected{cluster="cluster-2"} 1
      # HELP relay_push_clients_connected Current number of connected Relay push clients (Relay Agents).
      # TYPE relay_push_clients_connected gauge
      relay_push_clients_connected{cluster="cluster-1"} 1
      relay_push_clients_connected{cluster="cluster-2"} 1
      
  2. Review the Gloo Mesh UI. Check that the Overall Mesh Status is healthy and that your remote clusters are registered without any configuration issues.
    meshctl dashboard --kubecontext $MGMT_CONTEXT
    
  3. If the setup is unsuccessful, continue to Troubleshooting.

Troubleshooting relay certificates

  1. Review the health of your Gloo Mesh pods in the management and remote clusters.

    1. Check that the gloo-mesh-mgmt-server and gloo-mesh-agent pods are running.

      kubectl get pods -n gloo-mesh --context ${MGMT_CONTEXT}
      kubectl get pods -n gloo-mesh --context ${REMOTE_CONTEXT}
      
    2. If the pods are not running, describe the pods and check the State and Last State sections for error messages and reasons why the pod might not be healthy. For example, the following error messages in the gloo-mesh-mgmt-server and gloo-mesh-agent pods indicate that the secret is misnamed or missing. Check the secrets and names, upgrade your Helm installation, and try again.

      • Example error message for gloo-mesh-mgmt-server pod:
      Message:   3 errors occurred:
          * no tls secret found for grpc server: Secret "relay-server-tls-secret" not found
          * could not find forwarding server token: no token secret found: Timeout: failed waiting for *v1.Secret Informer to sync
          * no tls secret found for grpc server: Secret "relay-server-tls-secret" not found
      
      • Example error message for gloo-mesh-agent pod:
      Events:
        Type     Reason       Age                    From     Message
        ----     ------       ----                   ----     -------
        Normal   Created      84m (x25 over 3h28m)   kubelet  Created container gloo-mesh-agent
        Normal   Pulled       84m (x24 over 3h28m)   kubelet  Container image "gcr.io/gloo-mesh/gloo-mesh-agent:1.2.3" already present on machine
        Normal   Started      84m (x25 over 3h28m)   kubelet  Started container gloo-mesh-agent
        Warning  FailedMount  84m                    kubelet  MountVolume.SetUp failed for volume "kube-api-access-zlr9b" : [failed to fetch token: Post "https://kind2-control-plane:6443/api/v1/namespaces/gloo-mesh/serviceaccounts/gloo-mesh-agent/token": read tcp 172.18.0.5:47314->172.18.0.5:6443: use of closed network connection, failed to sync configmap cache: timed out waiting for the condition]
        Warning  FailedMount  84m                    kubelet  MountVolume.SetUp failed for volume "kube-api-access-zlr9b" : [failed to fetch token: Post "https://kind2-control-plane:6443/api/v1/namespaces/gloo-mesh/serviceaccounts/gloo-mesh-agent/token": read tcp 172.18.0.5:57262->172.18.0.5:6443: use of closed network connection, failed to sync configmap cache: timed out waiting for the condition]
        Warning  FailedMount  83m                    kubelet  MountVolume.SetUp failed for volume "kube-api-access-zlr9b" : [failed to fetch token: Post "https://kind2-control-plane:6443/api/v1/namespaces/gloo-mesh/serviceaccounts/gloo-mesh-agent/token": http2: client connection force closed via ClientConn.Close, failed to sync configmap cache: timed out waiting for the condition]
        Warning  BackOff      72s (x522 over 3h28m)  kubelet  Back-off restarting failed container
      
  2. Check the Kubernetes logs for the gloo-mesh-mgmt-server and gloo-mesh-agent pods in each cluster for errors. Look for errors during the grpc connection.

    • For example, the following error message indicates that the gloo-mesh-mgmt-server load balancer IP address was set incorrectly for the agent during the Helm installation.
    {"level":"warn","ts":"2021-11-02T19:56:42.197Z"caller":"zap/
    grpclogger.go:85","msg":"[core]grpcaddrConn.createTransport
    failed to connect to {34.145.18106:9900:9900 
    gloo-mesh-mgmt-server.gloo-mesh <nil> <nil>}. Err: connection 
    error: desc = \"transport: Error while dialing dial tcp: address 
    34.145.184.106:9900:9900 too many colons in address\".
    
    • The following gloo-mesh-agent pod error indicates that you need to follow the steps in ca.crt.
    {"level":"fatal","ts":1640102555.6522746,"msg":"secrets \"relay-root-tls-secret\" not found","version":"1.3.0-beta6","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:255"}
    
    • The following errors indicate that the server or client TLS certificate is expired. Regenerate the certificate, restart the pods, and try again.
    {"level":"error","ts":1650047047.6682806,"logger":"translator.reconcile-42","caller":"translator/reconciler.go:195","msg":"translation for parent object failed","parent":"istio-ingressgateway-istio-system-cluster1~gloo-mesh~cluster1~internal.gloo.solo.io/v2, Kind=DiscoveredGateway","err":"Gateway istio-ingressgateway.istio-system in cluster cluster1 not found in snapshot.","errVerbose":"Gateway istio-ingressgateway.istio-system in cluster cluster1 not found in snapshot.\n\ttranslator.(*translator).TranslateOutputs.func1:/src/pkg/translator/translator.go:163\n\ttranslator.(*translator).translateParallel:/src/pkg/translator/translator.go:189\n\tsets.(*discoveredGatewaySet).UnsortedList:/src/pkg/api/internal.gloo.solo.io/v2/sets/sets.go:999\n\tsets.(*resourceSet).UnsortedList:/go/pkg/mod/github.com/solo-io/skv2@v0.22.11/contrib/pkg/sets/sets.go:118\n\tsets.(*discoveredGatewaySet).UnsortedList.func1:/src/pkg/api/internal.gloo.solo.io/v2/sets/sets.go:994\n\ttranslator.(*translator).translateParallel.func1:/src/pkg/translator/translator.go:191\n\ttranslator.getValidEastWestIngressGateway:/src/pkg/translator/translator.go:426","stacktrace":"github.com/solo-io/gloo-mesh-enterprise/pkg/translator.(*reconciler).reconcilePrimary.func1\n\t/src/pkg/translator/reconciler.go:195\ngithub.com/solo-io/gloo-mesh-enterprise/pkg/utils/syncutils.(*workQueue).Execute.func1\n\t/src/pkg/utils/syncutils/parallel.go:52"}
    
    {"level":"info","ts":1650046690.815508,"caller":"grpclog/grpclog.go:37","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc00111c9d0, {TRANSIENT_FAILURE connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-04-15T18:18:10Z is after 2022-04-15T14:28:30Z\"}","system":"grpc","grpc_log":true}
    
  3. For gloo-mesh-agent pods, make sure that the cluster name matches the registered cluster name.

    1. Check the KubernetesCluster resources in the management cluster to get registered cluster names.
      kubectl get kubernetesclusters --context $MGMT_CONTEXT
      
    2. Check that the registered cluster name matches the name in the client certificate that is issued by the root CA, specifically the DNS SAN extension.
    3. If the cluster names do not match, update the KubernetesCluster to have the same name, or re-issue the client certificate with the same name.
  4. If you still have issues, review the Known issues.

Known issues

ca.crt

Although the ca.crt is included in the gloo-mesh-agent certificate secret, the gloo-mesh-agent still expects it to exist separately in the remote cluster. To copy it from the management cluster into the remote clusters, you can run the following command. Make sure to update $CLUSTER_NAME with your remote cluster name.

CLUSTER_NAME=$REMOTE_CLUSTER
CLUSTER_CONTEXT=$REMOTE_CONTEXT

kubectl get secret gloo-mesh-agent-$CLUSTER_NAME-tls-cert \
  --namespace gloo-mesh \
  --output json \
  --context $CLUSTER_CONTEXT \
  | jq 'del(.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid,.data."tls.key",.data."tls.crt",.metadata.annotations)' \
  | sed 's/gloo-mesh-agent-$CLUSTER_NAME-tls-cert/relay-root-tls-secret/' \
  | kubectl apply --context $CLUSTER_CONTEXT -f -

Certificate renewal

Currently, gloo-mesh-agent cannot automatically update certificates when they are renewed. You must recreate the gloo-mesh-agent pod in order to pick up the new certificates. For this reason, it is currently recommended that you make the time to live on the relay certificates to be long-lived.