Best Practices for MongoDB on the IBM Cloud
5 min read
This post provides advice and recommendations of best practices when using IBM Cloud Databases for MongoDB Standard and Enterprise plans.
We hope it will be useful for developers and administrators that are going to use IBM Cloud Databases for MongoDB. It summarises our experience of some questions our customers have when running MongoDB deployments on the IBM Cloud. This document assumes that the reader is familiar with MongoDB, as it will mainly deal with the differences between a self-hosted MongoDB cluster and one deployed and managed by IBM.
The recommendations are at the bottom of the blog post, but we highly recommend reading the whole post to understand the underlying context around our recommendations. We also recommend using our Getting-to-Production Checklist to make sure you are incorporating all best practices for adopting IBM Cloud Databases.
IBM Cloud Databases for MongoDB
Databases for MongoDB deployments consist of a three-node replica set deployed in three different availability zones. There is one primary in one zone and two secondary members in two additional separate zones. Both secondaries can become the primary in an election.
These replica sets provide additional fault tolerance and High Availability. IBM offers a Standard Plan and an Enterprise Plan for MongoDB deployments. You can read more about the features of the different plans here.
Best practice guidelines
Durability vs. accessibility
Like all other distributed databases, MongoDB presents application developers with architectural decisions that involve some trade-offs between data durability (how permanent a data write is) and data availability (how quickly your application can write and read data). The purpose of this document is to examine these trade-offs only in the context of the IBM Cloud Databases deployments of MongoDB.
Generally speaking, data is more durable (i.e., less likely to be lost) if you have more copies of it. In normal circumstances, all data will always end up in all nodes. But your application can decide how long it wants to wait for confirmation of writes before proceeding. This is called a "write concern." If you set a "write concern" of 1 (
w:1), then once the data is written to the first node (primary) the write is acknowledged and your application can proceed (see here for more information about write concerns).
In the case of the Databases for MongoDB deployment, where there is a three-node replica set, you could theoretically set a "write concern" of 3 (
w:3). In that case, your application would have to wait for all nodes (primary and two secondaries) to acknowledge the write before proceeding.
By doing this you can, in effect, get very close to an RPO (Recovery Point Objective) of 0, because it is highly unlikely that any data write will be lost (as you would have to suffer a disastrous simultaneous loss of three nodes that are physically located in three separate physical locations).
However, this is not a workable solution because that would risk your application writes being potentially blocked during expected maintenance events, as well as during unexpected failures on secondaries.
Similarly, when reading data, there are also trade-offs. You can read from the primary, which normally guarantees the latest data but puts more pressure on the primary (which is already handling all the writes). Or you can read from the secondaries, but you have a higher chance of getting stale data. Refer to the documentation to decide on the best read pattern for your application.
Nodes can become unavailable for a number of reasons, including hardware failure or networking problems. Distributed systems are designed to deal with these failures. In the case of MongoDB, if the primary node becomes unavailable, then the remaining nodes enter an automatic "election process," where they elect a new primary node and continue to operate. But this election process can take a number of seconds, during which writing to the database is blocked (see the docs for more information).
In addition, IBM performs regular security and feature updates to keep your database safe, compliant and exciting to use. Upgrades will not cause write or read blocks if you follow the recommendation below regarding write concerns. You can learn more about these in our Application High Availability section.
Based on the above, we recommend the following:
- Application design: There may be cases where connection exceptions occur (as described above). We recommend you catch these within your application and execute a reconnect and reissue cycle. In other words, design your applications with retry/reconnect logic. Expect and handle occasional write failures.
- Durability (1): Your application write commands should all apply a write concern of
majority. This means that the primary and one secondary will have to acknowledge the write before it is deemed successful. This is a reasonable compromise between durability and preventing your application being blocked in cases of node unavailability. Note that the write concern default on an IBM Cloud MongoDB deployment is currently one (i.e., only the primary need acknowledge a write for it to be deemed successful). To override this default, your application should issue the new write concern alongside every write command.
- Durability (2): Your application writes should also include a suitable timeout (the
wtimeoutparameter) to avoid getting blocked when writing is unavailable (and see the first bullet point above about retry logic). However, it is worth remembering that a timeout does not mean a failure to write, only that the write did not happen within a specified time limit. So a write command that timed out may still eventually succeed. Your application will need to understand that.
- Reading data: Your application readPreference should be set to
primaryPreferred. This avoids read connections being blocked when there is a failover event and a different node becomes the primary.
- Scaling (1): Use auto-scaling for storage (disk), as "disk full" errors will generate a service crash. Disk auto-scaling has no impact to the running Database deployment.
- Scaling (2): For RAM and/or CPU, we still recommend capacity planning for critical services with unpredictable demand and growth profiles. This requires more forethought but will better help you be prepared for growth in your application that may require more IOPs, RAM or vCPU.
Backup and restore
The IBM deployment of MongoDB (Standard and Enterprise Offerings) take automatic backups for you as part of a being a database-as-a-service. This backup is done to IBM Cloud Object Storage (COS) and happens every 24 hours. Therefore, the current Recovery Point Objective (RPO) is 24hrs. Customers are able to use the CLI and API to automate on-demand backups to drive potentially lower RPOs.
The restore process for Standard and Enterprise Offerings can be triggered via the user console or via API and restores into a new MongoDB Instance. The secondaries are then built from this new instance and a new cluster is created.
The length of this process (and therefore the Recovery Time Objective - RTO) depends on the amount of data being restored.
We highly recommend testing your business continuity and disaster recovery processes before going to production.
Refer to these documents for information on major upgrades and versioning policy:
Refer to this document for guidance on getting to production and other best practices: