This post provides advice and recommendations of best practices when using IBM Cloud Databases for Redis.
It summarises questions our customers have when running Redis deployments on the IBM Cloud, and we hope it will be useful for developers and administrators that are going to use IBM Cloud Databases for Redis. This document assumes that the reader is familiar with Redis, as it will mainly deal with the differences between a self-hosted Redis cluster and one deployed and managed by IBM Cloud.
The blog post is structured around various features of Redis, with recommendations in each section. We also recommend using our Getting-to-Production Checklist to make sure you are incorporating all best practices for adopting IBM Cloud Databases.
- A Databases for Redis deployment consists of one Leader Node and one Replica Node, located in two separate availability zones.
- For high availability, three Sentinel monitoring nodes are deployed on three separate availability zones to provide quorum and Leader election in cases of failover (see below).
- All Databases for Redis connections use TLS/SSL encryption for data-in-transit.
- All data is encrypted at-rest by default.
- Databases for Redis can be set up with access via public and/or private endpoints.
- Additionally, IP allowlists can be set up to further restrict allowable access.
- Access to the database is secured through the standard access controls provided by the database.
- More details on security and compliance can be found here.
Recommendation: We recommend you use Redis version 6 because it has better user management creation and ACL (Access Control List) capabilities (learn more about version 6 here). We also recommend this CLI for easier TLS access via the command line.
Redis is very fast because it keeps all of its data in-memory. However, a server restart (because of a crash, loss of power or other event) will mean a total loss of data. If you are running Redis as a cache, a loss of data may be acceptable (the cache can be rebuilt from another source). If, however, data loss needs to be avoided, Redis offers a number of persistence options where data is written to disk periodically.
In an IBM Cloud Databases for Redis deployment, data persistence is enabled and your data is written to disk by default. The data persistence model uses preamble snapshots (RDB) and AOF (Append Only File) mechanisms. The interval for Redis to write to disk (fsync) is set to once every second. You can read more about these options here.
Recommendation: If you want persistence, we recommend you retain the default settings as described above, as they offer a good balance between speed vs durability/recoverability.
Persistence can be turned off if you want to use Redis as a cache. This can also increase database availability, as the Redis process doesn't have to replay the transaction logs in case of a failover. For more information on deploying Redis as a cache see here.
In an IBM Cloud Databases for Redis deployment, you are provided one connection string that automatically routes to the Leader node. The Replica node is not directly reachable and provides high availability.
Recommendations on managing connections:
- Failover or switchover events do not redirect connections to the new Leader. Therefore, make sure your application is designed to reconnect and retry in case of connection failures. Check out this blog to learn how to do this with Node.JS and Redis: "Error detection and handling with Redis"
- If your application has more than 10-20 connections, consider connection pooling to optimise performance. See here or here for examples.
Databases for Redis deployments contain a cluster with two data members in a Leader plus Replica configuration, and they are kept in sync by using asynchronous replication. The Leader will send all commands to the Replica to keep it in sync. If the connection between the nodes fails, the replica will attempt to sync the data it is missing from the Leader upon reconnection. You can read more about replication here.
It is possible to set the cluster to synchronous replication with the WAIT setting. This will cause the cluster to wait until a given number of replicas have acknowledged the write.
Recommendation: We recommend keeping the default replication settings, as synchronous replication will cause higher latency from the Redis deployment and significant service degradation. If your deployment contains "source-of-truth" data (i.e., data whose durability is so important that it requires synchronous replication) it may be advisable to revisit your system architecture and use a different type of database.
The Leader can and will drop messages to the Replica when it cannot replicate fast enough. This can happen when an application writes so much into the Leader that the network cannot keep up.
Recommendation: This behaviour will be reflected in the logs (see below for Logging & Monitoring) and you should look out for that and throttle your application writes to the Leader so that the Replica can catch up again.
The Databases for Redis deployment is kept highly available by deploying a cluster of Sentinel nodes that monitor the health of the data nodes. If the Leader node fails, the Sentinel process will automatically decide to promote a healthy Replica to be the new Leader and demote the old Leader to a Replica. When unreachable nodes come back online, the replication processes described above ensures that all nodes eventually have the same data again. More information on Sentinel can be found here.
Recommendation: Failovers and/or switchovers will NOT re-route traffic of existing connections (see above for connection management). Your application should be designed to reconnect and retry when connection failures occur.
You may need to scale your deployment if you are running out of memory, disk space or available IOPs. You can change these for your IBM Cloud Databases for Redis deployment either manually or automatically, based on certain triggers.
By default, deployments are configured with a
noeviction policy. All data is kept in memory until the
maxmemory limit is reached and Redis returns an error if the memory limit is exceeded. The
maxmemory is set to 80% of a data node's available memory, so your node doesn't run out of system resources.
You can scale the amount of memory to accommodate more data, and you can configure the
maxmemory setting to tune memory usage. This needs to be set via the exposed configuration API, not via the DB connection. Otherwise, IBM will rewrite the setting when there are maintenance events. Please see our documentation for changing configuration parameters.
The Redis documentation has some good information on memory behaviour and tuning
The RAM provisioned to your deployment remains for your future needs or until you scale down your deployment manually.
Disk space and IOPs
The number of Input-Output Operations per second (IOPS) is limited by the type of storage volume. Storage volumes for IBM Cloud Databases for Redis deployments are provisioned on Block Storage Endurance Volumes in the 10 IOPS per GB tier (see here). It's possible for very busy databases to exceed the IOPS for the disk size, and increasing disk can alleviate a performance bottleneck. Note that disk cannot be scaled down.
Recommendation: We recommend that critical services with unpredictable demand and growth profiles adopt a capacity planning process based on manual, rather than auto-scaling. This requires more planning but will avoid unpredictable auto-scaling events. There is further information on scaling here.
Recommendation: We recommend that you monitor your disk usage and scale accordingly to avoid "disk full" database failures. See below for Monitoring.
Availability and/or downtime
In the context of IBM Cloud Databases for Redis, service availability and/or downtime can be affected by several reasons:
- Loss of connectivity: As with any cloud deployment, network events can affect access to the service.
- Node failure: In the case of a Redis node failure, the cluster will failover to a Replica. The whole process (from deciding that a Leader is down to promoting the Replica), can take several seconds and, meanwhile, the service will be unavailable.
- Planned maintenance: IBM performs regular (weekly) security and other updates to keep infrastructure safe and compliant. When performing maintenance on a Leader node, a switchover will occur to the Replica, which could result in short (seconds) connectivity interruptions.
Recommendation: You should design your applications to handle a temporary loss in connectivity to your deployment or to IBM Cloud. Many Redis clients have features for error checking and handling. See here for some examples.
Backup and restore
Automatic backups are performed daily and kept with a simple retention schedule of 30 days. You can trigger on-demand backups from the UI, via the CLI or API. Restoring is done into a new IBM Databases for Redis deployment.
For more information on backups and restoring see here.
Recommendation: As a general best practice, after performing a restore from backup, check the validity and integrity of your restored data.
Logging and monitoring
You can monitor your database metrics (and add alerts for specific thresholds) using IBM's monitoring service. See here for more details.
Recommendation: Monitor your disk usage, IOPS, RAM and CPU (if you have dedicated cores) and create high-usage alerts that give you enough time to scale your deployment if required.
You can also monitor your database logs using LogDNA. See here for more details.
Recommendation: Set log alerts for log entries that are "Critical" so that you can react to unusual events. For example, when the Leader node is dropping messages to the Replica because it cannot replicate fast enough. See above in the Replication section.
Minor version upgrades are done as part of the regular IBM maintenance cycles. See note above about availability.
Major version upgrades have to be done manually by clients. For more information see here.
When provisioning a Redis deployment, you can select dedicated cores. Selecting none means your deployment is part of a multi-tenant deployment. Moving to dedicated cores means that only your database is is guaranteed, at minimum, the amount of vCPUS selected. If you do, you should monitor your CPU usage and adjust by scaling if necessary (see above for monitoring and scaling).
Refer to these documents for information on major upgrades and versioning policy:
Refer to this document for guidance on getting to production and other best practices: Getting to production for Cloud Databases.