Recovering from a failover of a two data center deployment

How to recover a two data center disaster recovery deployment on WMware.

Before you begin

Ensure that you understand the concepts of a two data center disaster recovery deployment in API Connect. For more information, see A two data center deployment strategy on VMware.

About this task

After a failure has been resolved, the failed data center can be brought back online and re-linked to the currently active data center. It's important to do this as soon as possible, in order to reinstate disaster recovery. Otherwise, if another failure occurred before the failed data center is brought back online, there could be a complete outage.

The failed data center can be brought back online as the new active primary data center, or it can be brought back as a passive secondary data center and the currently active data center kept as the primary one. This decision will be based on your own company policies about recovering from a failure.

The amount of data that is sent when recovering from a failover, will depend on how long the outage lasted, and how much activity there was during that period.

Important:
  • If the active data center is offline, do not change the multi-site-ha-enabled setting to false on the passive data center, as this action deletes all the data from the databases. If you want to revert to a single data center topology, you must change the multi-site-ha-enabled setting to false on the active data center, and then redeploy the passive data center. If the active data center is offline, you must first change the passive data center to be active, and then change the multi-site-ha-enabled setting to false on the now active data center.
  • If the passive data center is offline for more than 24 hours, there can be issues with the disk space on the active data center so you must revert your deployment to a single data center topology. To revert to a single data center topology, you must change the multi-site-ha-enabled setting to false on the active data center. When the passive site has been redeployed, the multi-site-ha-enabled setting can be reapplied to the active site.

Procedure

  • Recovering from an API Manager failover

    The following instructions show how to bring data center one (DC1) back as the primary (active) data center, and assume that DC1 is now online and in the passive state. If you prefer, you can keep DC1 as the secondary data center, and leave data center two (DC2) as the primary (active) data center.

    1. Set DC2 (in this example, located in Raleigh) to be passive.
      Run apicup to change the multi-site-ha-mode property to passive on the Raleigh data center, for example:
      apicup subsys set mgmt_raleigh multi-site-ha-mode=passive
      where
      • mgmt_raleigh is the name of the management service on DC2.
    2. Run apicup to update the settings on DC2, for example:
      apicup subsys install mgmt_raleigh
    3. Log in to the virtual machine management subsystem on DC2 by using an SSH tool, and run the following command to check that the API Manager service is ready for traffic:
      kubectl describe ServiceName
      The service is ready for traffic when the Ha mode part of the Status object is set to passive. For example:
      Status:
        ...
        Ha mode:                            passive
        ...
      You must also ensure that the database in DC2 has replicated over completely before proceeding.
    4. Set DC1 (in this example, located in Dallas) to be active.
      Run apicup to change the multi-site-ha-mode property to active on the Dallas data center, for example:
      apicup subsys set mgmt_dallas multi-site-ha-mode=active
      where
      • mgmt_dallas is the name of the management service on DC1.
    5. Run apicup to update the settings on DC1.
      For example:
      apicup subsys install mgmt_dallas
    6. Update the dynamic router to redirect all traffic to DC1 instead of DC2. Note that until the DC1 site becomes active, the UIs might not be operational.
    7. Log in to the virtual machine management subsystem on DC1 by using an SSH tool, and run the following command to check that the API Manager service is ready for traffic:
      kubectl describe ServiceName
      The service is ready for traffic when the Ha mode part of the Status object is set to active. For example:
      Status:
        ...
        Ha mode:                            active
        ...
  • Recovering from a Developer Portal failover

    When recovering from a Developer Portal failover, how much data is sent from the now passive data center (in this example, DC2) back to the newly recovered active data center (in this example, DC1) varies depending on how much time has passed and how many changes have been made in the meantime.

    If possible, the Portal sends an Incremental State Transfer (IST), which is a replay of the transactions that have happened since the failed data center, DC1, became unreachable. The IST can grow to be up to 3 GB in size. If the IST grows larger than 3 GB, then the Portal sends a State Snapshot Transfer (SST) instead, which is basically a full copy of the whole database. How long this process takes depends on how many Portal sites there are, how much content in each of those sites, how active they are (as in number of transactions), and how fast the network speed is between the data centers. However, it is likely to take in the range 5 - 60 minutes.

    You can check whether the Developer Portal servers are in sync by running the status command in the portal-www pod admin container. If all servers show as primary, then they are in sync and traffic can be directed wherever needed. For information about running the status command, see Verifying deployment of the Developer Portal subsystem.

    Note:
    • If you want to monitor the recovery process, you can log in to the virtual machine Portal subsystem by using an SSH tool, and run the following command to check the status of the Portal pods in the PortalService CR file:
      kubectl describe ServiceName
      For more details about the status information in a PortalCluster CR file, see the Example section.
    • Do not edit the PortalService CR file during the recovery process.

    The following instructions show how to bring DC1 back as the active primary data center, and switch DC2 to be the passive secondary data center. Again, you can keep DC2 as the active primary data center if you prefer, and leave DC1 to be passive.

    1. Set DC2 (in this example, located in Raleigh) to be passive.
      Run apicup to change the multi-site-ha-mode property to passive on the Raleigh data center, for example:
      apicup subsys set port_raleigh multi-site-ha-mode=passive
      where
      • port_raleigh is the name of the Portal service on DC2.
    2. Run apicup to update the settings on the services on DC2.
      For example:
      apicup subsys install port_raleigh --skip-health-check
      Important: You must not wait for the active DC Status to become passive, before changing the passive DC to become active. You should change the DC2 multi-site-ha-mode to passive, and then run the kubectl describe ServiceName command, and as soon as the haMode is set to progressing to passive, you must change DC1 to be active.
    3. Set DC1 (in this example, located in Dallas) to be active.
      Run apicup to change the multi-site-ha-mode property to active on the Dallas data center, for example:
      apicup subsys set port_dallas multi-site-ha-mode=active
      where
      • port_dallas is the name of the Portal service on DC1.
    4. Run apicup to update the settings on the services on DC1.
      For example:
      apicup subsys install port_dallas --skip-health-check
    5. Update the dynamic router to redirect all traffic to DC1 instead of DC2.
    6. Log in to the virtual machine Portal subsystem on DC1 by using an SSH tool, and run the following command to check that the Developer Portal service is ready for traffic:
      kubectl describe ServiceName
      The service is ready for traffic when the Ha mode part of the Status object is set to active. For example:
      Status:
        ...
        Ha mode:                            active
        ...