Troubleshooting a two data center deployment
What to check when a 2DCDR deployment is not starting or not replicating data.
Typical causes of 2DCDR failures are network problems between data centers, TLS certificate mismatches, or environmental problems on one of the data centers.
Note: For OpenShift users: The steps that are
detailed in this topic use the Kubernetes
kubectl
command. On OpenShift, use the
equivalent oc
command in its place.Check that API Connect pods are running
Confirm that all the pods are ready and running:
- To check portal, run:
kubectl get ptl -n <namespace> NAME READY STATUS VERSION RECONCILED VERSION MESSAGE AGE <portal instance name> <ready pods>/<total number of pods> Running 10.0.5.x 10.0.5.x-xxxx Serving 1 site 14d
- To check management, run:
kubectl get mgmt -n <namespace> NAME READY STATUS VERSION RECONCILED VERSION AGE <management instance name> <ready pods>/<total number of pods> Running 10.0.5.x 10.0.5.x-xxxx 14d
The number of <ready pods> must match the <total number of
pods>. If not all the pods are in ready state,
run
kubectl get <mgmt or ptl> -n <namespace> -o yaml
and
check the status.conditions
output, for example: conditions:
- lastTransitionTime: "2022-09-27T21:37:21Z",
message: "Management installation in progress. Not all services are ready, pending services: analytics-proxy, apim, client-downloads-server, juhu, ldap, ...",
reason: na
status: "True"
type: Pending
Also, check which pods are not running
with:kubectl get pods -n <namespace>
and review their
logs:kubectl logs <pod name> -n <namespace>
Note: The
warm-standby management
subsystem has fewer pods than the active management subsystem. The active and warm-standby portal subsystems have
the same number of pods.
Check the multiSiteHA sections of the management and portal CRs
Run
kubectl describe
on your management and portal CRs to confirm that the
multiSiteHA
specification and status sections of the CR are as expected. For
example, on the warm-standby
management CR you should
see:kubectl describe mgmt -n <namespace>
...
spec:
Multi Site HA:
Mode: passive
Replication Endpoint:
Annotations:
cert-manager.io/issuer: ingress-issuer
Hosts:
Name: mgmt-replication.apps.example.com
Secret Name: mgmt-replication-server
Replication Peer FQDN: mgmt-replication.apps.example.com
Tls Client:
Secret Name: mgmt-replication-client
...
Status
...
Ha Mode: passive
...
Multi Site HA:
Remote Site Deployment Name: management
Remote Site Name: passive
Remote UUID: c736b702-b1ab-4fe2-b132-9c9d3b3a3bd3.9f986be1-14fc-4b3a-8eee-094217ce361e
Replication Peer: https://mgmt-replication.apps.example.com/
Check that the Hosts
and Replication Peer
properties are set
correctly in the multiSiteHA
section, and that the hostnames resolve correctly in
each data center.
Check the ingress-ca
X.509 certificates match
Replication fails if both data centers do not have the same X.509 certificate in their
ingress-ca
secrets. Run the following command in both data centers to see the X.509
certificate, and check that the output is the
same:openssl x509 -noout -fingerprint -sha256 -in <(kubectl get secret ingress-ca -n <namespace> -o yaml | grep "^ tls.crt:" | awk '{print $2}' | base64 -d)
If
you do not have the openssl
command available, you can instead run only the
kubectl
part, which produces a larger
output:kubectl get secret ingress-ca -n <namespace> -o yaml | grep "^ tls.crt:" | awk '{print $2}' | base64 -d
if
the outputs are different, follow these steps to synchronize the certificates:- If you installed API Connect on Kubernetes or
OpenShift using individual subsystem CRs, determine which data center has the
ingress-ca
Kubernetes cert-manager certificate object:
this is your source data center.kubectl get certificates -n <namespace> | grep ingress-ca
Note: If no certificates are returned when you run:
on either data center, then create a newkubectl get certificates -n <namespace> | grep ingress-ca
ingress-ca
Kubernetes cert-manager certificate object can be created on the active data center by using theingress-ca
certificate section of the installation YAML:helper_files/ingress-issuer-v1.yaml
:apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: ingress-ca labels: { app.kubernetes.io/instance: "management", app.kubernetes.io/managed-by: "ibm-apiconnect", app.kubernetes.io/name: "ingress-ca" } spec: secretName: ingress-ca commonName: "ingress-ca" usages: - digital signature - key encipherment - cert sign isCA: true duration: 87600h # 10 years renewBefore: 720h # 30 days privateKey: rotationPolicy: Always issuerRef: name: selfsigning-issuer kind: Issuer
The
ingress-ca
certificate can be lost in certain scenarios, such as redeployment of one of the data centers as warm-standby after a failure. - If you installed API Connect on Cloud Pak for Integration, use the current active data center as the source data center, and the warm-standby as the target data center.
- Extract the
ingress-ca
secret from your source data center to a file callednew-ca-issuer-secret.yaml
:kubectl get secret ingress-ca -o yaml -n <namespace> > new-ca-issuer-secret.yaml
- Edit the
new-ca-issuer-secret.yaml
file and remove thecreationTimestamp
,resourceVersion
,uid
,namespace
, andmanagedFields
. Remove the labels and annotations sections completely. The resulting contents should include theingress-ca
X.509 certificate, and the secret name:apiVersion: v1 data: ca.crt: <long cert string> tls.crt: <long cert string> tls.key: <long cert string> kind: Secret metadata: name: ingress-ca type: kubernetes.io/tls
- Copy the
new-ca-issuer-secret.yaml
to the target data center.
Follow these steps to apply the extracted
ingress-ca
X.509 certificate on your target data center:- To apply the
new-ca-issuer-secret.yaml
file, run:kubectl apply -f new-ca-issuer-secret.yaml -n <namespace>
- Regenerate all
ingress-ca
end-entity certificates:
All affected pods should automatically restart. For more information about regenerating certificates, see: Renewing cert-manager controlled certificates.kubectl get secrets -n <namespace> -o custom-columns='NAME:.metadata.name,ISSUER:.metadata.annotations.cert-manager\.io/issuer-name' --no-headers=true | grep ingress-issuer | awk '{ print $1 }' | xargs kubectl delete secret -n <namespace>
Check the operator and tunnel pod logs
The API Connect operator pod manages the 2DCDR deployment. The tunnel pods manage the communication between data centers.
Check the logs of the API Connect operator pod and
search for the text
multito see any errors that are related to 2DCDR. For example:
kubectl logs <ibm-apiconnect operator pod> -n <operator namespace> | grep -i multi
The
<ibm-apiconnect operator pod> has ibm-apiconnect
in its name,
and might be in a different namespace to your API Connect operand
pods.Check the logs of your API Connect
tunnel
pods. These pods always have tunnel
in their name. For
example:kubectl logs portal-tunnel-0 -n <namespace> --since=10m
...
[ ws-tunnel stderr] 400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 2022-12-02T12:27:33.786Z INFO tls incoming request {"remote-addr": "10.254.20.144:55042", "uri": "/portal-active-db-0/3060"}
[ ws-tunnel stderr] 400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 2022-12-02T12:27:33.786Z INFO tls connect to upstream {"remote-addr": "10.254.20.144:55042", "uri": "/portal-active-db-0/3060"}
[ ws-tunnel stderr] 400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 2022-12-02T12:27:33.817Z INFO tls closing connection {"remote-addr": "10.254.20.144:55042", "uri": "/portal-active-db-0/3060"}
[ ws-tunnel stderr] 400 0eec87:fb8309:21aa2d 2022-12-02 12:27:33: 10.254.20.144 - - [02/Dec/2022:12:27:33 +0000] "GET /portal-active-db-0/3060 HTTP/1.1" 101 0
Note: It
is normal for the management tunnel pod to repeatedly
log:
2022/12/02 12:29:17 http: TLS handshake error from 10.254.16.1:44812: EOF
This
message can be filtered out with grep
-v
:kubectl logs management-tunnel-574bdcd865-48zh6 -n <namespace> | grep -v "http: TLS handshake error from"