A Brief Overview of the Database Landscape
10 June 2019
7 min read
A closer look at the database landscape through licensing and data modeling.

If you’re just starting to explore the world of databases, you probably know two things already—using data effectively is lucrative, and picking a database to manage that data can be overwhelming. 

It’s true, data’s ability to improve top-line revenue is ever increasing. It can optimize user experiences and power machine learning. But that means there are hundreds of vendors fighting to store and analyze it for you. How do you choose? Well, as with most things in life, knowledge is power.

Check out this overview of the database landscape, specifically how to put databases into a business context. We’ll start with a deep dive into licensing and data modeling.

 
The spectrum of software licenses—demystified

Software licensing is, to put it mildly, complex. Not just intellectual property rights (for that, I suggest checking Besen & Raskind’s “An Introduction to the Law and Economics of Intellectual Property” and Rosen’s Open Source Licensing: Software Freedom and Intellectual Property Law), but specifically why you should care about a license and trends in the open source databases landscape that have implications for your business.

Ready?

First, let’s talk licenses

All software licenses carry rules and regulations for how you use the technology that you have to follow. That means the licenses of the software you adopt can have a tangible impact on how you do business. Ignoring or violating these rules can expose you to legal risk, financial loss, and, frankly, tarnish your company’s reputation. Whether you are purchasing software or adopting open source technologies, the license will ultimately constrain the usage of the code in some capacity. All this to say, be aware of these constraints as you develop your product to help mitigate longer-term legal risk.

Next, keep in mind that licenses aren’t fixed. In fact, right now, many companies that back open source database projects are in the process of changing their database licenses to become more restrictive. Depending on your use case, that may mean that if you’ve been using a database for free, you may now be exposed to legal action. That’s not to scare you, it’s just to make sure you stay vigilant. As these changes come about, it’s important to react appropriately. In some cases, a license change may require re-architecting a service, adopting a different database, or entering into a commercial agreement with the vendor.

Let’s explore this problem space a bit more

Those who frequent Hacker News (link resides outside ibm.com) or TechCrunch (link resides outside ibm.com) won’t be a stranger to the conversation around open source and commercial database software. Here’s the gist: In the past three years, a debate has erupted due to a confluence of factors like the growth of major public cloud vendors and the market success, or lack thereof, of open source-centric database providers.

That being said, the relationship between free software and proprietary software is not binary—it is, by all means, a spectrum:

Note: Illustrated distance is non-scientific, relativity is more important.

Looking at the spectrum above, at the far left, there are commercial, or proprietary, database software licenses, like Oracle, IBM Db2, and Microsoft SQL Server. These are powerful, feature-rich technologies that power workloads across every industry vertical. When purchasing this software from a vendor, or as a cloud service, you are paying a premium to get access to the following:

  • Code-level support

  • A robust ecosystem of tooling

  • Professional services and consultants

  • Visibility and influence into the roadmap of that database

On the right, there is public domain software. This software is under no copyright at all, meaning that it can be modified, distributed, or sold without restriction. Projects near the right end of the spectrum are often governed by the standards of an impartial and unbiased third party, such as the Apache Software Foundation or The Linux Foundation.

The Open Source Initiative (OSI) maintains a generally accepted list of what is and what isn’t an open source license (link resides outside ibm.com). In general, open source software is characterized by the ability to “fork the code.” This means that if the direction of the project (software) is at odds with what you need or want, you are welcome to modify or edit the code as you see fit.

Using an open source technology is particularly compelling due to zero licensing costs, greater development transparency, and innovation that comes from a diversity of stakeholders, maintainers, and problem spaces. Compared to commercial software, with open source software, you give up roadmap influence, guarantees around bug fixes or security patches, and contracts and get zero vendor lock-in and improved line of business flexibility. (You can see that’s a trade-off you and your team need to consider carefully.)

Following the journey on the chart above from left to right, there are varying levels of license permissiveness like Apache 2.0, MPL, and GPL 3.0.

Examples of databases mapped to licenses

  • Apache 2.0 (link resides outside ibm.com): Apache Cassandra, Apache CouchDB

  • Mozilla Public License (link resides outside ibm.com): RabbitMQ

  • BSD (link resides outside ibm.com): Redis

  • GPL 3.0 (link resides outside ibm.com): Neo4j

  • Proprietary: IBM Db2, Microsoft SQL Server

A bit of history for context

In the late 2000s, most nascent database vendors were heading to market as “open source” in order to garner easy access to adoption and developer mindshare. You may know companies in this camp, like Mongo Inc., Redis Labs, and Elastic. These companies developed community projects like MongoDB, Redis, and Elasticsearch but looked to monetize that investment with Enterprise License versions, managed cloud implementations, or professional services.

However, the paradigm shift of cloud computing has made this business model precarious because major vendors can easily provide these technologies as a first-class, platform-native managed service. These offerings are delivered with compelling integrations for security, compliance, monitoring, and logging on their respective clouds, without providing guaranteed compensation to the creators of the software.

In recent years, companies have reevaluated their Route to Market. Now, we’re seeing them adopt licensing models that protect their development investments. For example, MongoDB (link resides outside ibm.com), Redis Labs (link resides outside ibm.com), and Confluent (link resides outside ibm.com), with varying degrees of severity, have all changed the licenses of portions of code to prevent other companies from running them as a service without compensation.

“So, Josh, what’s your advice?” Great question.

Look, there are good reasons to use both commercial and/or open source databases. The important thing is that you know what you’re getting yourself and your company into. Review the license before you build an application to ensure your project is compliant, and if you’re instead looking to pick a license for an open source project, check out Github’s “Choose a License” (link resides outside ibm.com) webpage.
So, one important consideration is licenses because no one wants to get sued. The other is the data model families, but for a different reason. When building an application, knowing your way around data model types will help you pick the right tool for the job.

Data model families: The Fab 5

Now that you’ve got a handle on licenses, let’s talk about another critical consideration when selecting your database—data models.

When I first started at IBM, I needed to get up to speed fast, so I turned to Martin Fowler’s NoSQL Distilled.

In his writing, and in the industry at large, people tend to categorize databases into five “data model” families: document, key-value, graph, relational, and wide columnar. Here’s a quick overview of each one, including use cases and database-specific examples. This will help you determine, based on your data sets and business needs, which database you need.

1. Document

In this case, data is modeled in JSON-like documents, rather than rows and columns. These databases, by nature, value availability over transactional consistency. Document databases lend themselves to simplicity and scalability, as well as fast iteration in development.

Business use cases:

  • Mobile apps that require fast iterations

  • Event logging, online shopping, content management and in-depth analytical processing

  • Retail catalogs with product attributes

Examples:

2. Key-Value 

This type of model represents the most basic type of non-relational database, where each item in the database is stored as an attribute name (referred to as a key) with its corresponding value.

Business uses cases:

  • User preference and profile stores

  • Product recommendations based on browsing data

  • Shopping carts

Examples:

  • DynamoDB

  • Redis

  • etcd

3. Graph

Data here is modeled as vertices and edges (values and connections). Similarly to how people think and process information, graph databases recall the relationships between discrete units of data. These databases make the persistence, exploration, and visualization of data and relationships more intuitive.

Business uses cases:

  • Fraud detection

  • Real-time recommendation engines

  • Master data management

  • Network and IT operations

  • Identity and access management

Examples:

4. Relational 

The relational model (link resides outside ibm.com), introduced by R.F. Codd while here at IBM, is the titan of the industry. Data is stored in tables as rows and columns and often have sophisticated query engines for analytics and exploration. Relational databases support transactional guarantees and ACID (atomicity, consistency, isolation, and durability) compliance, whereas most databases in the other four families are eventually consistent.

Business uses cases:

  • E-commerce

  • Enterprise resource planning

  • Customer relationship management

Examples:

5. Wide Columnar

Column family stores enable very quick data access using a row key, column name, and cell timestamp. The flexible schema of these types of databases means that the columns don’t have to be consistent across records, and you can add a column to specific rows without having to add them to every single record. Wide columnar stores are derived from Google’s BigTable paper (link resides outside ibm.com). These data models shouldn’t be confused with Column-Oriented storage models, which is more relevant to data warehousing technologies and analytical access patterns due to improved compression of data on disk and more efficient use of CPU.

Business use cases:

  • Security and stock market analytics

  • Click stream analytics

  • IoT and telemetry

Examples:

  • Apache Cassandra

  • DataStax Enterprise

  • Google Cloud BigTable

The long and short of it is this—there are advantages and disadvantages to each primary data model (and we barely scratched the surface here). But when in doubt, go with something battle-tested and ubiquitous like PostgreSQL. To learn more about Data Model Families archetype, check out Martin Fowler’s book NoSQL Distilled, particularly chapters 8-11.

Ready to learn more about databases?

Phew! I covered a bit of ground here, but if you are itching to learn more, here are some suggestions based on time investment:

Looking to get building? The IBM Cloud has a wide range of managed databases services to help your team get moving fast.

Author
Josh Mintz Program Director