"The Right Tool for the Job": An increasingly challenging decision
Choosing technology for applications—new or legacy—is hard work. Gone are the days of the 1990s, when a few vendors dominated the software market; and gone, too, are the days of the 2000s, when LAMP stacks signaled the incursion of open source into "enterprise" application stacks. Developers today face a bewildering proliferation of technologies to choose from, and these choices aren't trivial:
- Does this software do what it says it does?
- Is this framework going to be dead in five years?
- Can I reliably hire programmers that know this language?
Today's software is largely built on open source, and open source—both the software itself, and the communities around it—moves quickly.
Furthermore, microservices architectures compound these concerns—now it's not just one application's worth of decisions that need to be made. Instead, many organizations find themselves in a Groundhog's Day of tools selection imposed on them by this Conway's Law and this newfound freedom to choose the right tool for the job.
If you'd like an even deeper dive into this subject, check out our webinar: "How to Choose a Database on IBM Cloud."
Not all technical decisions are created equal—they vary widely both in difficulty and in impact. Programming languages debates, for example, easily devolve into aesthetics and Turing completeness (i.e., there's typically no right answer, and as long as the team is reasonably sharp, it's not likely to be a make-or-break decision).
Choosing a database is a different class of challenge. The ability to make a rigorous, informed choice of database technology requires a trifecta of the following:
- Deep understanding of the application problem at hand.
- Knowledge of database semantics, including distributed computing fundamentals.
- Skill in seeing through enterprise software marketing.
This isn't a set of talents easily found or obtained. Unfortunately, making a non-rigorous, uninformed choice of database technology can be a terminal mistake for an application.
It's also common for developers to offload these decisions to abstractions—"the framework handles all of that for me"—and ignore the details of the database underneath. This, too, can be a terminal mistake; (good) abstractions are powerful tools, but unfortunately most database abstractions, such as ORMs, leak profusely. That's not to say they're not useful, but it does imply that they can't be blindly trusted. Even when working through a framework, understanding the semantics of the underlying database is essential.
Modern cloud platforms provide a counterbalance here through managed database-as-a-service ("DBaaS") offerings. By using DBaaS products, rather than self-hosting databases, application developers can limit the risk of this critical decision in two ways:
- Most of the databases available from the major cloud platforms are good databases. These vendors have already done most of the heavy lifting of eliminating low-quality software from the decision space. Cloud platforms have an incentive to make their customers successful, and offering them subpar software isn't a good way to do that.
- By offloading operational management to a cloud platform, application developers offload a significant risk and cost. The price of a database-as-a-service will deliver not only hosting and 24x7 management, but also automated high-availability, backup/restore, version upgrades, security, and so on. Many of the best databases in the world are "free" but the knowledge to implement a production-quality installation isn't. By utilizing a DBaaS platform, developers minimize the learning curve and can base their decisions on requirements and value, rather than cost.
At IBM, we present a wide range of DBaaS offerings, from traditional relational stores like Db2 and PostgreSQL, to more esoteric systems such as etcd and Redis. We do this because we believe these are good technologies, and each has its place in someone's application architecture. We've narrowed the field, but the decision remains—how does one choose the right database for an application?
Josh Mintz does a great job of summarizing in "A Brief Overview of the Database Landscape."
Problems of this subtlety and complexity warrant more than a simple flowchart or decision-tree style process—the space is too complicated to wrangle without some sort of decomposition. What one needs are lenses: tools for looking at the problem from different angles and making rigorous, informed decisions about those local problems that can be combined to inform global decisions.
What follows are some example lenses for making decisions about databases. These are by no means exhaustive, but they're far better than nothing. Again, the goal is to make better decisions, not perfect ones.
Primary and auxiliary data stores
First off, something that might be obvious. Not all databases are general purpose databases. Database engineering is like any other engineering practice; it's mostly about tradeoffs. Some databases—call them "primary" data stores, for lack of a better term—really are "jack of all trades" systems, designed to solve 80% of any problem thrown at them, but they're really not all that common. These database feature general purpose data modeling tools (typically relational or document-based) and general purpose query languages, such as SQL. They take data integrity seriously. They're flexible enough to solve most problems, and that's a powerful feature. As an application's requirements evolve over time, "primary" data stores are more likely than not going to be able to support that evolution.
Most databases, on the other hand, are "T-shaped"—they excel at a few specialized tasks and fall flat when applied to problems outside of their domain. This isn't to say they're bad databases—again, this is about engineering tradeoffs. These T-shaped databases—call them "auxiliary"—typically have restricted data modeling tools (key-value, graph, columnar) and restricted query languages. They're trading off generality, and often data integrity, for exceptional performance in a limited domain.
When an application's requirements line up with the strong suit of an auxiliary-type database, the result can be tremendous. When they don't, the result can be, well, tremendous. Without a clearly understood reason otherwise, primary-type databases should be the default choice for most applications.
Write-optimized vs. read-optimized stores
This is a very complex topic, with a rich literature, far beyond the scope of this article. In short: databases must make tradeoffs between write performance and read performance. To a first-order approximation this is usually obvious: Db2 is a write-optimized OLTP store; Db2 Warehouse is a read-optimized OLAP store. Don't mix them up!
Indexing: Zero to full-text
Understanding the semantics of all of these different options and their tradeoffs is no easy feat, of course. The most important point here is not re-inventing the wheel. A home-grown indexing implemetation is unlikely to be semantically correct, let alone as efficient as a database-native indexing mechanism. In short, use real indexes for applications that need indexes.
Gigabytes and terabytes
How much data does an application need? How will it grow as the application grows? Does it really need "infinite" scalability, or can the data be logically partitioned in some way? There's no avoiding the fact that distributed systems are hard; and, again, database engineering is about tradeoffs. Single-node systems are generally simpler to reason about and simpler to program against, and that's important, even if someone else is operating the database.
What happens when it fails?
Consistency and availability
Twenty years on, Brewer's CAP theorem is standard distributed systems lore, but understanding its implications up the software stack is still a nontrivial exercise. The reality is much more complicated than "C, A, P, pick two, and you have to pick P." Many databases include flexible consistency controls that allow application developers to change CAP semantics on an operation-by-operation basis; others provide "CP" semantics on some operations and "AP" semantics on others.
Databases (or parts of them) are going to fail. Developers need to understand the impact of a partial database failure on their application and select a database with semantics that fit their needs. For example, a social network probably doesn't need strong consistency, but a banking application probably does.
On the other hand, programming against an eventually-consistent system is typically more challenging than the alternative. It's not always natural for developers to internalize the idea that their application might not be able to read its writes.
Note, too, that distributed systems have a tendency to sneak in to even "simple" single node systems. PostgreSQL is a non-distributed database in the naive sense, but a highly-available PostgreSQL with some sort of replication between members is most definitely distributed.
Database high availability is a critical component of application high availability, but it's not the complete story. Some situations (e.g., regional disasters or systemic corruption) require proper backup and restore mechanisms. And again, not all databases offer the same level of functionality here.
Elasticsearch, for example, exploits its immutable on-disk representation to implement an efficient incremental backup mechanism. This allows very frequent backups of the largest clusters with very little impact to the performance of the system.
Elasticsearch does not, however, represent data in a manner suitable for continuous archiving. PostgreSQL, on the other hand, does, which allows for even more frequent backups and point-in-time restoration functionality. This comes at a cost, however—replaying transactions for a PostgreSQL point-in-time restore will likely take longer than restoring a snapshot of a similarly-sized Elasticsearch cluster. In other words, there's a tradeoff (surprise!) here between recovery point objectives (i.e., RPO, i.e, how "old" is the data being restored) and recovery time objectives (i.e., RTO, i.e., how long it takes to restore). Backup implementation decisions vary widely based on platform and the implementation details of the database in question; RPO and RTO are generally good high-level abstractions to use when evaluating them.
Putting it all together
In the end, this all roughly boils down to Software Engineering 101 crib notes, specialized on databases:
- Engineering is about tradeoffs.
- Understanding the impact of these tradeoffs is critical to building software.
- Learning how to understand the impact is not trivial.
While there's no exhaustive set of lenses for looking at these problems, it's important to get a foothold in to the space. Without one, (3) is insurmountable—databases are too complicated for intuitive reasoning. At IBM Cloud, we've done a lot of the work already through careful selection of our database portfolio, but there are still decisions to be made, and we're ultimately better off when our customers are happy with the databases they've chosen for their applications.
Most of the state of the art in databases derives from the academic literature, either directly or indirectly; anyone looking to learn more shouldn't hesitate to dive in there. Readings in Database Systems, or the "Red Book" is a great resource for learning about databases from the ground up. More comprehensive treatments of transactional systems can be found in Gray and Date.
There are innumerable smart individuals blogging about databases and distributed systems on the web. To name a few: Kyle Kingsbury, Henry Robinson, and Martin Kleppmann are all worth reading, for starters.