Exploring the different factors to consider while trying to determine the best database options when refactoring to a microservices approach.
In a prior article, we touched upon a few of the different aspects that can lead you to refactor your code to a microservices-based approach. One of the last aspects we touched upon was the question of what to do with your data—in large-scale enterprise applications, that is often the thorniest issue and one that is worth a more in-depth treatment. What we have seen is that, sometimes, it’s hard to determine when you are looking at a coding issue or a data modeling issue that is masquerading as a coding issue.
We’ll look at several of these cases and talk about data modeling choices you can make to simplify and improve your refactored code. But first, we need to address what is often the initial question asked by teams refactoring existing enterprise applications into microservices.
Starting slow: One big database or many little databases?
If your application is like most of the applications that we see that are starting down the road to be refactored into microservices, we can probably safely assume that it’s working with a single, large relational database. What’s more, there’s a nearly even chance that this database is an Oracle database—all the rest of the relational databases (DB2, SQL Server, Informix, or even an open source database like MySQL or Postgres) split the remainder of the share. In fact, moving off of the (usually costly) enterprise relational database is one of the benefits often promoted for refactoring to microservices.
Now, there are very good reasons to pick other types of databases—either NewSQL or NoSQL for many microservices. However, abandoning your current relational database all at once is often a foolhardy decision. Instead, you may want to look at a more incremental approach to changing your database, just like we advocate an incremental approach to refactoring your existing Java code.
Just like the incremental approach for coding that we have advocated, however, the biggest problem in following an incremental approach for database refactoring is deciding where to start. The first decision you have to make once you decide upon an incremental approach is whether you should go with one big database or many little databases. Now, at first, this sounds like nonsense—of course, you don’t want one big database, that’s what you have in your monolith! But let’s explain what we mean first.
Essentially, the place to begin is making a distinction between a database server and a database schema. For those of you who are familiar with enterprise-scale databases like Oracle or Db2, this is second nature because enterprises will generally have one big Oracle server (or a RAC, which is a big server made up many smaller servers) on which multiple teams will host their own separate databases (each represented by a separate schema). The reason this is done is that licensing is often by CPU, and the company wants to maximize the amount of use they can get for their money. Putting multiple teams together in one big server is a way to do this. For those of you who are more familiar with open source databases like MySQL or PostgreSQL, this is a little less common because the distinction is less often needed.
This matters because when we’re talking about building databases for microservices, it's important that we reduce or eliminate coupling at the database—but that really means coupling at the database schema level. Problems arise when you have two different microservices that use the same information—the same tables within the same schema. You see what we mean when you look at Figure 1.
The database equivalent of a monolithic application (AKA the “Big Ball of Mud,” where everything connects to everything else) is having one large schema that connects every table together. Whenever that happens, the amount of untangling required to separate the tables is enormous.
However, as you transition to microservices, you must realize that there are many fewer problems caused by sharing hardware and licenses at the server level. In fact, when doing the initial refactoring to microservices, there are a lot of advantages to keeping the new, more cleanly separated schemas in the same enterprise servers because companies usually already have procedures in place for database backup and restore and for database server updates, which the teams can take advantage of.
In a sense, what the enterprise is offering through providing the hardware and software and management of an enterprise database is a limited version of a Database-as-a-Service (DBaaS). This especially fits well with an approach that begins with more cleanly separating parts of your monolith by functional area, starting out with a “Modular Monolith” approach, as shown in Figure 2.
In this example (meant to show a refactoring work-in-progress), you can see how the database has been broken up by separating out tables corresponding to three new schemas (A, B, and C) that correspond to specific modules in the refactored application. Once they have been separated like this, they can be cleanly broken out into distinct microservices. However, D and E are still being refactored—they still share a single schema with interconnected tables.
Eventually, even the linking at the database-server level may become a problem. For instance, you are limited to the available features of an enterprise database if that is what you choose. Even in a relational model, not all schemas need all of those features, or they may need features that are better supported through a different server (e.g., better sharding is often brought up as a reason for using a NewSQL database). Likewise, upgrading a database server shared by multiple microservices could take multiple services down at once.
But the thing is that this is a decision that can be put off—it doesn’t have to be made immediately at the start of the project. When teams begin initially refactoring, keeping them in the same database server for at least the initial stages of a refactoring project is a way of making incremental changes while the team gains necessary experience in code and database refactoring.
Considering non-relational models
As the last section hinted, as you proceed in the refactoring process, you then may want to think a bit more carefully about your database options—not all sets of data are perfectly suited for a relational model. Deciding what the best approach is for managing those data sets often comes down to the question of “what are you actually storing in your database?”
We have spent years helping companies build, maintain, and often torture Object-relational mapping frameworks, but the reality of the issue was that in several cases, the data that was being stored did not map well at all to a relational data model. Whenever that occurred, we found ourselves having to either “twist” the relational model to fit, or more likely, to jump through hoops in the programs in order to force the code to match the relational data store.
Now that we’ve moved on to an era of polyglot persistence choices, we can re-examine some of these decisions and make some better ones. In particular, we want to look at four different cases where it’s obvious that the relational model was not the best option and then consider a case where the relational model was the best option and putting the data in another form would not have been the right approach.
Handling Blob storage
Many times, we’ve looked through the persistence code of enterprise systems only to find, to our surprise, that what they’ve actually been storing in their relational databases are binary representations of serialized Java objects. These are stored in “Binary Large Object” or “Blob” columns and are usually the result of a team throwing up their hands at the complexity of trying to map their Java objects into relational tables and columns. Blob storage has certain disadvantages: It can never be queried on a column basis; it is often slow; and it is sensitive to changes in the structure of the Java objects themselves—older data may not be readable if the object structure changes significantly.
If your application (or, more likely, a subset of your application) is using Blob storage in a relational database, that’s already a pretty good sign that you might be better off using a key-value store like Memcached or Redis. On the other hand, you may want to take a step back and think a little bit about what you’re storing. If it turns out to be just a structured Java object (perhaps deeply structured, but not natively binary) then you may be better off using a Document store like Cloudant or MongoDB. What’s more, with a little bit of effort spent into how you store your documents (for instance, both of the above-mentioned databases are JSON document stores, and JSON Parsers are both widely available and easy to customize) you can handle any “schema drift” issues much more easily than you could with a Blob store approach, which is much more opaque in its storage mechanism.
Flat objects and the Active Record pattern
Years ago, when Martin Fowler was writing “Patterns of Enterprise Application Architectures,” we had an active correspondence and several lively review meetings about many of the patterns. One in particular always stuck out as an odd duck—the Active Record pattern. It was odd because coming from the Java community, we had never encountered it (although Martin assured us that it was common in the Microsoft .NET programming community). But what really struck us about it—and especially when we started seeing a few Java implementations of it using open source technologies like iBatis—was that it seemed like the best case for using it was when the objects were, well, flat.
If the object you are mapping to a database is completely and totally flat—with no relationships to other objects (with the limited exception perhaps of nested objects)—then you probably aren’t taking advantage of the full capabilities of the relational model. In fact, you are much more likely to be storing a document. Very often in the cases where we’ve seen this, teams are literally storing electronic versions of paper documents, be they customer satisfaction surveys, problem tickets, etc. In that situation, a document database like Cloudant or MongoDB is probably the best match for you. Splitting your code out into its own services that work on that type of database will result in much simpler code and will often be more easily maintained than trying to do the same as part of a large enterprise database.
Handling reference data
Another common pattern that we’ve seen in ORM systems is the combination of “reference data in a table sucked into an in-memory cache." Reference data consists of things that are not often (or ever) updated, but that are constantly read. A good example of this is the list of U.S. states or Canadian provinces; other examples include medical codes or standard parts lists. This kind of data is often used to populate drop-downs in GUIs.
The common pattern is to start off by reading the list from a table (usually a flat two or, at most, three-column table) each time it is used. However, you will then find that the performance of doing that every time is prohibitive, so instead, the application reads it into an in-memory cache like EhCache at startup.
Whenever you have this problem, it is begging to be refactored into a simpler, faster caching mechanism. Again, this is a situation where Memcached or Redis would be perfectly reasonable. If the reference data is independent of the rest of your database structure (and it often is or is, at most, loosely coupled) then splitting the data and its services away from the rest of your system can help.
The query from hell
In one customer system that we worked on, we were doing complex financial modeling that required very complicated queries (on the order of six- or seven-way joins) just to create the objects that the program was manipulating. Updates were even more complicated, as we had to combine several different levels of optimistic locking checks just to find out what had changed and if what was in the database still matched the structure we had created and manipulated.
In retrospect, it’s clear that what we were doing would have been more naturally modeled as a graph. Situations like this (in this case, we were modeling tranches of funds, each made up of different types of equities and debt obligations, and each of those priced in different currencies and due at different times, with different rules surrounding each valuation) are something that almost begs for a data structure that will allow you to easily do what you really want to do—traverse up and down the graph and move parts of the graph around at will.
This is where a solution like Apache Tinkerpop or Neo4J would be a good approach. By modeling the solution directly as a graph, we could have avoided a lot of complicated Java and SQL code, and at the same time, probably significantly improved our runtime performance.
Back to the future; or, when traditional SQL or NewSQL shines
Even though there are many cases where NoSQL databases are the right logical approach for particular data structures, it is often hard to beat the flexibility and power of the relational model. The great thing about relational databases is that you can very effectively “slice and dice” the same data into different forms for different purposes. Tricks like Database Views allow you to create multiple mappings of the same data—something that is often useful when implementing a set of related queries of a complex data model.
For a deeper dive into a comparison of NoSQL and SQL see "SQL vs. NoSQL: What's the Difference?"
As we discussed in a previous article, if the “lower bound” of a microservice is a set of Entities grouped into an Aggregate together with the associated set of Services that operate on that data, implementing the views to represent a set of related queries is often easiest and most straightforward with SQL.
We first saw that in the customer example that led to our simple Account/LineItem example in the previous article. In this case, there was a bank that was trying very hard to make a simple model just like that work in a Document-oriented NoSQL database, only to be defeated by the CAP theorem. However, the team had chosen that data model for all the wrong reasons. They had chosen to use the Document-oriented database for all of their diverse microservices out of a misplaced desire for architectural consistency.
In this case, they needed all of the attributes of the ACID mode, but they did not need sharding (e.g., partitioning); their existing system had functioned for years at an entirely acceptable level of performance on a relational model, and they were not anticipating enormous growth. But, although the core of the system needed ACID transactions and did not need partitioning, that was not necessarily true of all of the different parts of their system. There are some things that SQL databases are great at, and ACID transactions are one of them. In microservices systems, grouping those ACID transactions together around the smallest set of data they operate on is usually the right approach.
SQL does not necessarily mean traditional SQL Databases, however. It can, and there is certainly a place for that in many microservices architectures, but SQL is also implemented in at least two other types of databases that can be useful choices for many teams implementing microservices. The first is “small SQL,” which is the domain of open source databases like MySQL and Postgres. For many teams implementing microservices, these databases are a perfectly reasonable choice for many reasons:
- Teams that are refactoring existing applications into microservices usually have experience with both SQL and object-relational mapping frameworks like Hibernate or Spring Persistence. Leveraging that knowledge while staying within the bounds of a microservices approach is certainly reasonable.
- These databases are very well supported, both by the open source community and by many vendors that provide support for them. It is relatively easy to find documentation and tutorials for them and to find developers skilled in them.
- These databases are small and lightweight enough to containerize easily and deploy and update through the same GitOps mechanisms you use for your application code.
- These databases are supported in SaaS versions from most or all of the hyperscalar public clouds.
The major drawback of using these “small SQL” databases is that they often do not support the same level of scale (particularly with regard to sharding) that enterprise databases can support. However, this has been admirably addressed by a new set of database vendors and open source projects that combine the best attributes of small SQL databases and the scalability of NoSQL databases. Often called “NewSQL” databases, these include CockroachDB, Apache Trofidion, and Clustrix.
Whenever you need all of ACID transactions, a SQL engine that fully supports the relational model, and also wide scaling and sharding capabilities, NewSQL databases can be a good choice for teams. That choice comes at a cost though–these databases are often more complex to set up and manage than the older “small SQL” solutions. It is also harder to find people with skills in these databases. Regardless, though, they still need to be carefully considered when working through your options.
Where do you go from here?
We’ve taken a whirlwind tour through a number of different database modeling issues that can masquerade as coding issues. If you find you have one or more of these particular issues, then you may be better off splitting away from your existing enterprise data store and re-envisioning it with a different type of data store. Our advice is, in any case, do this slowly and incrementally. Be aware that not all applications need all types of data models and that you need to build skills and gain an understanding of the operational capabilities of your chosen databases over time before you begin implementing them at a wide scale.
Finally, be cognizant of the fact that the entire microservices approach is based on the idea that the teams building each microservice can make their own decisions on what data store is right for them. The biggest single problem we’ve encountered (as many of these stories have hinted) is trying to make a single data representation and storage approach work for all situations.
Many times I’ve seen teams have to try and deal with the realities of the CAP theorem in situations where they shouldn’t have even been using a NoSQL database. Likewise, trying to do complex reporting from a NoSQL database is often an exercise in frustration. On the other hand, the situations we’ve shown point out that the relational model isn’t all-encompassing. Other models have their places as well. The one best piece of advice we can give is to ensure that your teams have the necessary autonomy to pick the right model for each microservice.
IBM Garage is built for moving faster, working smarter, and innovating in a way that lets you disrupt disruption.