Troubleshooting a two data center deployment

What to check when a 2DCDR deployment is not starting or not replicating data.

Typical causes of 2DCDR failures are network problems between data centers, TLS certificate mismatches, or environmental problems on one of the data centers.

Note: For OpenShift users: The steps that are detailed in this topic use the Kubernetes kubectl command. On OpenShift, use the equivalent oc command in its place.

Check that API Connect pods are running

Confirm that all the pods are ready and running:
  • To check portal, run:
    kubectl get ptl -n <namespace>
    
    
    NAME                         READY                                   STATUS    VERSION    RECONCILED VERSION   MESSAGE          AGE
    <portal instance name>       <ready pods>/<total number of pods>     Running   10.0.5.x   10.0.5.x-xxxx        Serving 1 site   14d
    
  • To check management, run:
    kubectl get mgmt -n <namespace>
    
    NAME                         READY                                   STATUS    VERSION    RECONCILED VERSION   AGE
    <management instance name>   <ready pods>/<total number of pods>     Running   10.0.5.x   10.0.5.x-xxxx       14d
The number of <ready pods> must match the <total number of pods>. If not all the pods are in ready state, run
kubectl get <mgmt or ptl> -n <namespace> -o yaml
and check the status.conditions output, for example:
  conditions:
  - lastTransitionTime: "2022-09-27T21:37:21Z",
    message: "Management installation in progress. Not all services are ready, pending services: analytics-proxy, apim, client-downloads-server, juhu, ldap, ...",
    reason: na
    status: "True"
    type: Pending
Also, check which pods are not running with:
kubectl get pods -n <namespace>
and review their logs:
kubectl logs <pod name> -n <namespace>
Note: The warm-standby management subsystem has fewer pods than the active management subsystem. The active and warm-standby portal subsystems have the same number of pods.

Check the multiSiteHA sections of the management and portal CRs

Run kubectl describe on your management and portal CRs to confirm that the multiSiteHA specification and status sections of the CR are as expected. For example, on the warm-standby management CR you should see:
kubectl describe mgmt -n <namespace>

...
spec:
  Multi Site HA:
    Mode:  passive
    Replication Endpoint:
      Annotations:
        cert-manager.io/issuer:  ingress-issuer
      Hosts:
        Name:               mgmt-replication.apps.example.com
        Secret Name:        mgmt-replication-server
    Replication Peer FQDN:  mgmt-replication.apps.example.com
    Tls Client:
      Secret Name:  mgmt-replication-client
...
Status
...
  Ha Mode:                               passive
...
  Multi Site HA:
    Remote Site Deployment Name:  management
    Remote Site Name:             passive
    Remote UUID:                  c736b702-b1ab-4fe2-b132-9c9d3b3a3bd3.9f986be1-14fc-4b3a-8eee-094217ce361e
    Replication Peer:             https://mgmt-replication.apps.example.com/


Check that the Hosts and Replication Peer properties are set correctly in the multiSiteHA section, and that the hostnames resolve correctly in each data center.

Check the ingress-ca X.509 certificates match

Replication fails if both data centers do not have the same X.509 certificate in their ingress-ca secrets. Run the following command in both data centers to see the X.509 certificate, and check that the output is the same:
openssl x509 -noout -fingerprint -sha256 -in <(kubectl get secret ingress-ca -n <namespace> -o yaml | grep "^  tls.crt:" | awk '{print $2}' | base64 -d)
If you do not have the openssl command available, you can instead run only the kubectl part, which produces a larger output:
kubectl get secret ingress-ca -n <namespace> -o yaml | grep "^  tls.crt:" | awk '{print $2}' | base64 -d
if the outputs are different, follow these steps to synchronize the certificates:
  1. If you installed API Connect on Kubernetes or OpenShift using individual subsystem CRs, determine which data center has the ingress-ca Kubernetes cert-manager certificate object:
    kubectl get certificates -n <namespace> | grep ingress-ca
    this is your source data center.
    Note: If no certificates are returned when you run:
    kubectl get certificates -n <namespace> | grep ingress-ca
    on either data center, then create a new ingress-ca Kubernetes cert-manager certificate object can be created on the active data center by using the ingress-ca certificate section of the installation YAML: helper_files/ingress-issuer-v1.yaml:
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: ingress-ca
      labels: {
        app.kubernetes.io/instance: "management",
        app.kubernetes.io/managed-by: "ibm-apiconnect",
        app.kubernetes.io/name: "ingress-ca"
      }
    spec:
      secretName: ingress-ca
      commonName: "ingress-ca"
      usages:
      - digital signature
      - key encipherment
      - cert sign
      isCA: true
      duration: 87600h # 10 years
      renewBefore: 720h # 30 days
      privateKey:
        rotationPolicy: Always
      issuerRef:
        name: selfsigning-issuer
        kind: Issuer

    The ingress-ca certificate can be lost in certain scenarios, such as redeployment of one of the data centers as warm-standby after a failure.

  2. If you installed API Connect on Cloud Pak for Integration, use the current active data center as the source data center, and the warm-standby as the target data center.
  3. Extract the ingress-ca secret from your source data center to a file called new-ca-issuer-secret.yaml:
    kubectl get secret ingress-ca -o yaml -n <namespace>  > new-ca-issuer-secret.yaml
  4. Edit the new-ca-issuer-secret.yaml file and remove the creationTimestamp, resourceVersion, uid, namespace, and managedFields. Remove the labels and annotations sections completely. The resulting contents should include the ingress-ca X.509 certificate, and the secret name:
    apiVersion: v1
    data:
      ca.crt: <long cert string>
      tls.crt: <long cert string>
      tls.key: <long cert string>
    kind: Secret
    metadata:
      name: ingress-ca
    type: kubernetes.io/tls
    
  5. Copy the new-ca-issuer-secret.yaml to the target data center.
Follow these steps to apply the extracted ingress-ca X.509 certificate on your target data center:
  1. To apply the new-ca-issuer-secret.yaml file, run:
    kubectl apply -f new-ca-issuer-secret.yaml -n <namespace>
  2. Regenerate all ingress-ca end-entity certificates:
    kubectl get secrets -n <namespace> -o custom-columns='NAME:.metadata.name,ISSUER:.metadata.annotations.cert-manager\.io/issuer-name' --no-headers=true | grep ingress-issuer | awk '{ print $1 }' | xargs kubectl delete secret -n <namespace>
    All affected pods should automatically restart. For more information about regenerating certificates, see: Renewing cert-manager controlled certificates.

Check the operator and tunnel pod logs

The API Connect operator pod manages the 2DCDR deployment. The tunnel pods manage the communication between data centers.

Check the logs of the API Connect operator pod and search for the text multi to see any errors that are related to 2DCDR. For example:
kubectl logs <ibm-apiconnect operator pod> -n <operator namespace> | grep -i multi
The <ibm-apiconnect operator pod> has ibm-apiconnect in its name, and might be in a different namespace to your API Connect operand pods.

Check the logs of your API Connect tunnel pods. These pods always have tunnel in their name. For example:
kubectl logs portal-tunnel-0 -n <namespace> --since=10m

...
[ ws-tunnel stderr]   400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 2022-12-02T12:27:33.786Z	INFO	tls	incoming request	{"remote-addr": "10.254.20.144:55042", "uri": "/portal-active-db-0/3060"}
[ ws-tunnel stderr]   400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 2022-12-02T12:27:33.786Z	INFO	tls	connect to upstream	{"remote-addr": "10.254.20.144:55042", "uri": "/portal-active-db-0/3060"}
[ ws-tunnel stderr]   400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 2022-12-02T12:27:33.817Z	INFO	tls	closing connection	{"remote-addr": "10.254.20.144:55042", "uri": "/portal-active-db-0/3060"}
[ ws-tunnel stderr]   400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 10.254.20.144 - - [02/Dec/2022:12:27:33 +0000] "GET /portal-active-db-0/3060 HTTP/1.1" 101 0
Note: It is normal for the management tunnel pod to repeatedly log:
2022/12/02 12:29:17 http: TLS handshake error from 10.254.16.1:44812: EOF
This message can be filtered out with grep -v:
kubectl logs management-tunnel-574bdcd865-48zh6 -n <namespace> | grep -v "http: TLS handshake error from"