A closer look at business continuity and disaster recovering planning when using Databases for PostgreSQL
Everyone loves building the next cool new thing, right? But an often-overlooked component of delivering and maintaining new applications and solutions is understanding how your system behaves and responds when disaster strikes and the order of operations to return to a healthy status.
This exercise is called Business Continuity and Disaster Recovering Planning and should be on every team's checklist to be ticked off before going to production. For more background on disaster recovery and business continuity planning, see "Disaster Recovery: An Introduction."
In this article, I'll show you an example of how a system can respond during disruption of normal operations with IBM Cloud Databases for PostgreSQL. We will discuss high availability, read replicas, back-up architectures, and methods for global failover. By the end, you should have a good understanding of Databases for PostgreSQL's availability architecture and capabilities that enable your team to act with confidence during an outage.
Key concepts in business continuity planning
- Recovery Point Objective (RPO): The point in time (in the past) to which your environment recovers, which indicates the amount of potential data loss or age of data that must be recovered from the disaster recovery backups for normal operations to resume.
- Recovery Time Objective (RTO): The elapsed time between the disaster being declared and the restoration of your production environment service.
- High Availability: Database deployment architecture that provides the ability to withstand all outages (planned, unplanned, and disasters) and to provide continuous processing for all important applications.
- Write Ahead Logging (WAL): One way that PostgreSQL ensures data integrity. Effectively, changes to data files are only flushed after they have been logged as committed to permanent storage.
How high availability works in Databases for PostgreSQL
With Databases for PostgreSQL, you get high availability on your database from the moment of creation. Our high availability implementation uses a leader-follower pattern with asynchronous replication and a distributed consensus mechanism, backed by etcd, to maintain cluster state and handle failover.
In an IBM Cloud Multi-Zone Region (MZR), the leader and follower are always deployed in two separate data centers with 99.99% SLA. Databases for PostgreSQL provides one endpoint and seamlessly handles failover between leader and follower in the case of a failover event. For anyone developing cloud native applications, you should make sure to embed error detection and retry logic on the client side to handle pesky transient errors in your system.
High availability is great for helping users manage disruptions in a single region, but what other tooling and capabilities do you have as a Databases for PostgresSQL consumer on the IBM Cloud to get your application healthy if an entire region is offline?
Good news! Databases for PostgreSQL offers Read Replicas and Point-in-Time-Recovery (PITR). Let's take a deep dive into how these features actually work and then evaluate their usage through the lens of disaster scenarios.
Database for PostgreSQL Backups 101
Databases for PostgreSQL automatically configures and runs backups on your database. This data is stored in a Cross-Regional IBM Cloud Object Storage, which means that the fault domains are spread across geographically distinct regions of the IBM Cloud. For example, a Databases for PostgreSQL instance in US South would be backed up to the US Cross-Regional Object Storage that spans San Jose, Dallas, and Washington D.C.
Databases for PostgreSQL backups are automatically configured to be able to perform PITR for any time in the last 7 days. The deployment performs continuous archiving and can replay transactions to restore a new deployment from a backup to any point within that 7-day window. Neat, right?
Backup type for scheduled backups is driven by number of successful backups of a given type, not the number of days. Under normal operations, scheduled backups working daily means a seven-day cycle. A full backup is taken on Day 1, differential backups are taken on Days 2 to 7, and incremental backups capture changes since the last differential when a user initiates an on-demand backup. We then repeat the cycle on Day 8 with another full backup.
One thing to keep in mind—if backups fail in that cycle, that schedule may be pushed forward. For instance, let's say backups fail for 3 days. Day 1 is a full backup, but on Days 2 to 4, backups fail. This means that on Days 5 to 10, differential backups will be taken; then on Day 11, the backup cycle will begin again with a full backup.
Restoring a backup
Now that we understand how the backups get taken, let's discuss how they get restored.
Restoring a backup requires one of the following, depending on requirements:
- Restore the last full backup.
- Restore the last full backup + 1 differential backup.
- Restore the last full backup + n-incremental backups.
- Restore the last full backup + 1 differential backup + n-incremental backups.
Restores need to fetch and replay all the transaction logs from the duration of the backup. The performance of a restore will be dictated by the size and write activity of the database. For example, a 500GB database that is under low write load during the backup will take less time to restore than a 500GB database under heavy write load during the backup.
What happens when you click "Restore"
The restoration procedure will begin based on the section above. The newest backup that is before the requested restore point (timestamp or end of archived transaction logs) will be automatically determined and restored. After that point in time is determined, the subsequent transaction logs will then be retrieved and applied to restore the database to the requested restore point.
Keep in mind that restore performance is dictated by the same constraints as the standard restores above plus the additional time to retrieve and replay the subsequent transaction logs after the initially required backup is restored. This time will be dictated by the time since the last completed backup prior to the restore point plus the level or write activity on the database since then. Therefore, higher write workloads will create more WAL files and increase the time to restore.
How this impacts RPO and RTO
The point in time to which you can recover is effectively tied to WAL file rotation and archive time. WAL files are rotated every 16MB of data or at the
archive_timeout value (set to 30 minutes on Databases for PostgreSQL). That means RPO will be—at worst—30 minutes under normal operation, and it will be less for any database with a non-trivial write load. Think of it as whichever happens first—16MB of writes to the database or 30 minutes elapsing. Changing the value of
archive_timeout allows further configurability here—decreasing this value will reduce RPO to a minimum of five minutes, but may subsequently increase your RTO.
The time it takes for your backup to fully restore is tied to performance of the restores as earlier discussed, but there are ways to reduce RTO. For example, one can take on-demand backups to reduce the amount of time needed to replay WAL files (as it can decrease the time since the last backup).
It is advisable to use caution with this approach. Backing up your database every hour actually has a negative effect on your RTO. For instance:
- Taking too many backups will slow our backup tooling down (both backups and restores).
- Restoring a database that has a large number of incremental backups may take longer than simply restoring a backup from several hours back and replaying the transaction logs.
Tip: We advise you to test the RPO of your system. If it's satisfactory, you can consider adding more on-demand backups in a 24-hour period. If it's unsatisfactory, try reducing your usage of on-demand backups to improve RPO.
RTO for Databases for PostgreSQL with low write loads will be longer on PostgreSQL 9.4, 9.5, and 9.6 than on PostgreSQL version 10 and above. This is due to internal improvements in how files are archived on low write databases. If you are sensitive to reducing RTO for your database, we highly recommend being on PostgreSQL 10 or greater.
All about Read Replicas
You are now officially a backup expert. Let's talk about Read Replicas!
Databases for PostgreSQL allows for the creation of up to five geo-distributed Read Replicas. These replicas of your database will receive asynchronous updates from your source database and are "read only." A Read Replica can be programmatically promoted to a standalone instance of Databases for PostgreSQL that becomes available for read and write, but severs the relationship to its original source database. A team might use Read Replicas to create a data mart for their data science department or, in the case of our discussion, to have a copy of their data in another region.
At a technical level, a Read Replica is single member server subscribed as a PostgreSQL streaming replica to another database instance—let's call it the "source database." The Read Replica and source database can be in the same region, or different regions:
- Attempts to write to a Read Replica will fail at the PostgreSQL level.
- Read Replicas can be promoted in the event of losing the source database. This will generally be much faster than a backup/restore during a disaster.
- Not helpful for the "Oops, something deleted some critical data by mistake" case, as that would be replicated. More on this below!
Because pictures can say a thousand words, we have mapped the whole journey out for you:
Evaluating disaster recovery strategies under different failure modes
Different types of disasters call for different types of responses. Below, we evaluate some potential disruptions that you might want to think about during the course of solution architecture and how we would recommend mitigating them.
Networking failure in one zone
In the event of a networking failure in one zone of a Multi-Zone Region in the IBM Cloud, Databases for PostgreSQL will automatically handle failover of the database, if necessary, to a newly promoted leader in a second zone. Your application may see an interruption that should last less than 30 seconds. No action is required from you other than making sure your client library or applications employs retry logic and error handling.
Data corruption/data loss
Let's say that you drop a table or roll out an application change that corrupts data in your database. Since a Read Replica asynchronously and automatically ingests changes from the affected deployment, we recommend restoring a backup from the your last known point in time when the database was healthy. For example, if you dropped a table at 12:05 PM, ask your backup to be restored from 12:04PM. Databases for PostgreSQL can either restore the most recent available point in time or a specific historical point in time. Make sure to check out the documentation to understand how to validate the restore procedure.
Region unavailability with prioritization on database uptime
If your system architecture prizes availability over data integrity, perhaps because you have a separate microservice that allows you to easily replay the data into PostgreSQL, we recommend running Databases for PostgreSQL with a Read Replica in another Multi-Zone Region. For example, if all US South goes offline, you could easily promote your Read Replica in US East into a Read/Write database and point your application to your database in US East. We recommend that you select the option "skip_initial_backup" during Read Replica promotion if you are really in a hurry. This means that you might not have a backup or be able to perform a PITR restore of your new database for up to 24 hours or until you initiate an On-Demand backup
Region unavailability with prioritization on data fidelity
Your system may instead prefer data fidelity over being up. Perhaps you are a commercial airline and really don't want to lose an order that's already been confirmed by your email reservation system. In this case, you would rather requests fail for a bit rather than having confirmed requests be lost. Even when a region goes down, all of our backups are stored across multiple regions in the IBM Cloud. This means that when you want to make sure you save the most data possible (and it's ok to be down for a bit longer), you should deploy your Databases for PostgreSQL backup to another region using the parameter for the latest available point in time.
Get started with IBM Cloud Databases for PostgreSQL
Thanks for taking the time to review how Databases for PostgreSQL is designed to provide you maximum uptime and data integrity during many different types of disruptions or disasters. Now let's go build something awesome!
You can get started in minutes with Databases for PostgreSQL in the IBM Cloud catalog.