Asynchronous mirroring
Asynchronous mirroring allows the local site to be updated immediately and the remote site to be updated as bandwidth allows. The information is cached and sent later, as network resources become available. While this can greatly increase application response time, there is some risk of data loss.
Network bandwidth
When synchronous mirroring is used, you need to provide enough network bandwidth to handle the data mirroring workload at its peak, in order to ensure acceptable response time. However, when asynchronous mirroring is used, you may only need to provide enough network bandwidth for slightly more than the average amount of the data mirroring workload. It really depends on how much the peak differs from the average, and whether the production site cache is large enough to hold the excess write requests during the peak periods. In most cases, asynchronous mirroring requires a less expensive, lesser bandwidth network than synchronous mirroring. For example, if a synchronous solution requires a network that is only 10 percent utilized most of the time, but the same workload can be mirrored asynchronously over a low bandwidth network that is 75 percent utilized most of the time, then asynchronous mirroring may be a better choice than synchronous mirroring.
Network latency
Asynchronous mirroring allows data mirroring at the disaster recovery site to lag behind application writes that occur at the production site. This can greatly improve application response time by having AIX® LVM tell the application that its write has completed after the data has been written to the local disks, but without having to wait for the data to be written to the remote disks. The remote physical volume write requests are cached at the production site and mirrored to the disaster recovery site over a longer period of time, effectively removing the effects of network latency, which in turn allows the sites to be much farther apart without impacting application response time.
If the remote data mirroring is able to keep up with demand enough to prevent the cache from filling up, then there may not be a noticeable delay in application response time. However, once the caching limit is reached, application writes will have to wait until the there is space in the cache. In a write-intensive application workload, the remote mirroring would quickly reach the cache limit and application response time would decrease. In such an environment, asynchronous mirroring does not offer any improvement over synchronous mirroring and, because of the risk of data loss, is not the best choice for mirroring.
Preventing data loss
Asynchronous mirroring creates the possibility of some amount of data loss from a production site disaster. If the remote site mirroring lags behind the local site, then you run the risk of losing that cached data in the event of a disaster. You need to determine how much data you are willing to risk losing.
Remote physical volume write requests are cached in permanent storage at the production site until they are written to disk at the disaster recovery site. After a node crash, you can recover these write requests. For example, suppose that a node crashes while it has the volume group varied online. You can recover the crashed node, bring the volume group back online, and have the asynchronous mirroring pick up where it stopped, with no more data loss than when using ordinary volume groups.
If you stop the application workload and take a volume group offline, all outstanding remote physical volume writes are written to disk at the remote site. For example, if you take the production site down for planned maintenance, you do not want the volume group to be brought online at the disaster recovery site while there are still outstanding writes sitting in the cache at the production site. By forcing the remote site to be brought up to date at the time that the volume group is taken offline, the application workload avoids accessing back-level data by mistake. Additionally, graceful PowerHA® SystemMirror® failover of asynchronously geographically mirrored volume groups from the production site to the disaster recovery site can take place without any data loss. The drawback to this approach is that it will take longer for the volume group to be taken offline, when the cache contains a backlog of remote physical volume write requests. Depending on how big the backlog is, it can take a very long time for all of the writes in the cache to be written to the disks at the remote site. And this, in turn, can cause all types of graceful failovers, whether they are local peer or site failovers, to take a very long time.
The only time when data loss may occur while using asynchronous mirroring, beyond what would be expected when using ordinary volume groups, is when the entire production site suddenly fails, before the mirroring to the disaster recovery site has had a chance to catch up. Whether or not the data is really lost depends on the circumstances of the failure, and in some cases how you want to deal with those circumstances. For example, a flood or fire can destroy all of the hardware at the production site. In that scenario, data loss would almost certainly occur. The lost data would consist of all the non-mirrored remote physical volume writes that were in the cache on the production site at the time of the failure. In another situation, a power outage can bring down the entire production site without destroying any hardware. In this scenario, the data is still there, but it cannot be accessed until the power can be restored and the system can be brought back online. You can choose to wait for the production site to be recovered, so you can avoid losing the non-mirror data or you can move your application workload to the disaster recovery site, with some amount of data loss.
Data divergence
Data divergence is a state where each site's disks contain data updates that have not been mirrored to the other site. For example, if a disaster destroys the disks at the production site, then the only copy of the data exists at the disaster recovery site. Using asynchronous mirroring, the possibility exists that the data will be back level, due to data caching. However, it is possible for the production site to fail without hardware damage. In this case, the data is still there, but cannot be accessed until the production site can be brought back online. In this case, you can wait for the production site to be brought back online or you can move the application workload over to the disaster recovery site. If you move the application workload over to the disaster recovery site, you risk data divergence as the application begins to use the back-level data on the disaster recovery site disks. You will need to determine what action PowerHA SystemMirror should take should the production site go down while the disaster recovery site contains back-level data.
Once data divergence occurs, you will have to decide how you want to recover. If very few or no transactions are made at the disaster recovery site, you should be able to move back to production site with few complications. However, if your disaster recovery site has been running your applications for a long time, you cannot simply go back to your production site without risking some of your data.