Upgrading a two data center deployment

How to upgrade a two data center disaster recovery (DR) deployment on Kubernetes and OpenShift.

Full instructions about how to upgrade your API Connect deployment are contained in the following topic:

Upgrading API Connect

However, there are some important additional instructions if you have a two data center DR deployment; see the following sections:

Key points when upgrading a two data center DR deployment
How to upgrade when one data center is down

For general information about two data center disaster recovery in API Connect, see A two data center deployment strategy on Kubernetes and OpenShift.

Key points when upgrading a two data center DR deployment

Both data centers must be using the same version of API Connect, including down to the specific fix pack and iFix level, as appropriate. Each data center is effectively an identical deployment instance, so they must be kept on identical code level bases. Therefore, both data centers must be upgraded in the same maintenance window.

You must upgrade each subsystem in a particular order, upgrading it on the passive (standby) data center first, and then upgrading it on the active (primary) data center. This ordering of the upgrade is important, as the replication between the primary and the standby subsystems might fail if the passive subsystem is older than the active subsystem. Use the following instructions to upgrade your subsystems, paying particular attention to the detailed instructions for the Management subsystem:

The Management subsystem cluster:
1. Apply the v10.0.1.next CRDs on the passive cluster. For example:
```
kubectl apply -f ibm-apiconnect-crds.yaml
```
2. Update the passive ibm-apiconnect operator image to v10.0.1.next (note update only the operator). For example:
```
kubectl apply -f ibm-apiconnect.yaml
```
  Note: If your two data center deployment used custom namespaces during the installation, you must ensure that those same namespaces are used in all of the files that are needed for the upgrade that require namespaces, for example the ibm-apiconnect operator yaml file.
3. Wait for the Management subsystem status to become Running. Note that the new operator should reconcile the old version operand, and this might take some time. For example:
```
kubectl get mgmt -n passive_mgmt_cluster
NAME         READY   STATUS    VERSION              RECONCILED VERSION       AGE
management   7/7     Running   10.0.1.2-ifix2-eus   10.0.1.2-ifix2-100-eus   4h43m
```
4. Apply the v10.0.1.next CRDs on the active cluster. For example:
```
kubectl apply -f ibm-apiconnect-crds.yaml
```
5. Update the active ibm-apiconnect operator image to v10.0.1.next (note update only the operator). For example:
```
kubectl apply -f ibm-apiconnect.yaml
```
  Note: If your two data center deployment used custom namespaces during the installation, you must ensure that those same namespaces are used in all of the files that are needed for the upgrade that require namespaces, for example the ibm-apiconnect operator yaml file.
6. Wait for the Management subsystem status to become Running. Note that the new operator should reconcile the old version operand, and this might take some time. For example:
```
kubectl get mgmt -n active_mgmt_cluster
NAME         READY   STATUS    VERSION              RECONCILED VERSION       AGE
management   7/7     Running   10.0.1.2-ifix2-eus   10.0.1.2-ifix2-100-eus   4h43m
```
7. Remove the multi-site HA CR section from the passive Management subsystem. As soon as the multi-site HA configuration is removed, the passive Management subsystem converts itself to an independent cluster where you can see all the microservices.
8. Wait for the Management subsystem status to become Running on the passive Management subsystem. For example:
```
kubectl get mgmt -n passive_mgmt_cluster
NAME         READY   STATUS    VERSION              RECONCILED VERSION       AGE
management   7/7     Running   10.0.1.2-ifix2-eus   10.0.1.2-ifix2-100-eus   4h43m
```
9. Remove the multi-site HA CR section from the active Management subsystem. As soon as the multi-site HA configuration is removed, the active Management subsystem converts itself to an independent cluster where you can see all the microservices.
10. Wait for the Management subsystem status to become Running on the active Management subsystem. For example:
```
kubectl get mgmt -n active_mgmt_cluster
NAME         READY   STATUS    VERSION              RECONCILED VERSION       AGE
management   7/7     Running   10.0.1.2-ifix2-eus   10.0.1.2-ifix2-100-eus   4h43m
```
  At this point you have two independent Management clusters.
11. Now upgrade both clusters to 10.0.1.<next>-eus in parallel. For example:
```
kubectl -n $NAMESPACE edit mgmt mgmt_cluster
```
```
version: 10.0.1.5-eus
```
12. For both clusters, wait for the Management subsystem status to become Running. For example:
```
kubectl get mgmt -n mgmt_cluster
NAME         READY   STATUS    VERSION        RECONCILED VERSION   AGE
management   16/16   Running   10.0.1.5-eus   10.0.1.5-xxx-eus    5h46m
```
13. Now, add the multi-site HA CR section back into the previously active Management subsystem.
14. Wait for the Management subsystem status to become Running on the active Management subsystem. For example:
```
kubectl get mgmt -n active_mgmt_cluster
NAME         READY   STATUS    VERSION        RECONCILED VERSION   AGE
management   17/17   Running   10.0.1.5-eus   10.0.1.5-xxxx-eus    5h56m
```
15. Add the multi-site HA CR section back into the previously passive Management subsystem.
16. Wait for the Management subsystem status to become Running on the passive Management subsystem. For example:
```
kubectl get mgmt -n passive_mgmt_cluster
NAME         READY   STATUS    VERSION              RECONCILED VERSION       AGE
management   7/7     Running   10.0.1.5-eus         10.0.1.5-xxxx-eus        5h46m
```
  You should see a limited set of microservices on the passive Management subsystem, for example 7/7.
17. Run the following command on the active Management subsystem to check that the API Manager service is ready for traffic:
```
kubectl describe ServiceName
```
  The service is ready for traffic when the Ha mode part of the Status object is set to active. For example:
```
Status:
  ...
  Ha mode:                            active
  ...
```
The Gateway subsystem cluster.
The Portal subsystem cluster; passive followed by active.
Note: If you upgrade from before Version 10.0.1.1-eus to Version 10.0.1.1 or later, you must upgrade the passive and active sites in parallel. This method means that both sites are down and then they restart their pods at the new versions in parallel so that they can cluster together.
The Analytics subsystem cluster.

You must verify that each subsystem has been updated successfully on both the passive and the active data centers before upgrading the next subsystem.

The Developer Portal upgrade takes place in two stages. Firstly, the system is upgraded, which updates the internal deployment scripts, packages, and so on. Then, as soon as the system upgrade is complete, the individual Portal sites are upgraded to the new level of the Portal UI code. This second stage, the Portal site upgrades, doesn't start until the first stage is complete and all of the pods (in both data centers) have been upgraded to the new version. The upgrade of the sites then starts automatically, a few at a time, spread across both data centers to allow the upgrade to run in parallel.

How to upgrade when one data center is down

In this scenario, one data centre is down and you have to failover to the other data center in your deployment. So, you are now running a two data center deployment, but only one data center is online and is in active mode. However, a critical fix must be applied before you can recover the failed data center. In this instance, you must apply the upgrade to the remaining active data center. Then, when you are ready to bring the failed data center back online, you must first remove the multi-site HA section from the failed data center CR, and then add the multi-site HA section back in again making that data center passive. This action triggers a re-sync with the active data center, and ensures that the passive data center contains the latest data.

Note: If one data center is completely down, and you cannot set it to be passive, you must ensure that the network links to that data center are removed before continuing to set your other data center to be active. You must then not restore the network links to the failed data center until you can set it to be passive.

On Kubernetes, this involves the following steps:

Remove the multiSiteHA section from the subsystem CR on the passive data center.
Apply the updated CR file to the subsystem on the passive data center, for example:
```
kubectl apply -f subsystem_cr.yaml -n <namespace>
```
Where subsystem_cr.yaml is the file name of the subsystem CR, and <namespace> is the target installation namespace in the Kubernetes cluster for the passive data center.

Add the multiSiteHA section back into the subsystem CR on the passive data center, for example:

multiSiteHA:
  mode: passive
  replicationEndpoint:
    annotations:
      cert-manage.io/issuer: ingress-issuer
    hosts:
    - name: mgrreplicationraleigh.cluster2.example.com
      secretName: raleigh-mgr-replication-worker-1
  replicationPeerFQDN: mgrreplicationdallas.cluster1.example.com
  tlsClient:
    secretName: mgr-replication-client

Apply the updated CR file to the subsystem on the passive data center again, for example:
```
kubectl apply -f subsystem_cr.yaml -n <namespace>
```