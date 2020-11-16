JM: You talked about a variety of different databases that are out there at the time of DataStax’s creation. So, that problem (or I don’t necessarily want to call it a problem), but that situation still exists today and probably has increased by some order of magnitude. Right? When I look at the industry, I see a wide, wide variety of database vendors to pick from or open source database technologies to use. So, from your perspective, having been there since Day Zero with Cassandra and DataStax, why should someone use Cassandra versus myriad arrays of the other databases out there to pick from?

JE: A friend of mine named Andy Pavlo — he’s a professor at Carnegie Mellon — runs a web page called the dbdb: The Database of Databases. He just kind of collects database products to preserve a little about them for history because not all of them last very long. Last I checked, he has over 700 entries there. So, there’s definitely a whole lot of options if you want to look for a database to use for your next project. Realistically, there’s only a handful [of options] that are mature enough to solve general purpose problems. If you look at those 700 options, some of them are really specifically targeted at a very narrow niche and others are kind of hobbyist projects.

If you want to narrow it down on industry standards, you’re probably better off looking at like the top 10 from the DB Engines ranking, where they look at how many people are using these, how many jobs are being created around these, and those types of signals.

If I were to evangelize a Cassandra today, I’d point at a couple of things. First of all, anything in those top-10 databases have one thing in common — they are not brand new. It takes time to stabilize the database and get the bugs beat out of it, especially if you’re talking about a distributed system. There is just no replacement for getting a lot of people to use it and shaking it down. Eric Raymond famously said that “with enough eyes, all bugs are shallow.” That’s not literally true, but there’s a lot of truth to the spirit of it. I don’t think there’s a shortcut to getting that kind of quality without getting that many eyes on it.

The traditional Cassandra strengths over the years have been around bulletproof stability, performance and scale, and Java development experience.

From a stability point of view, I talked to a DataStax customer recently that has been a customer for over six years, and they’ve had zero downtime. This wasn’t toy use case. The database is doing hundreds of thousands of operations a day, it has been through multiple releases during that time and had zero downtime. That’s a realistic achievement with Cassandra. It’s not like the customers are in some 99th percentile special case. Zero downtime is a real thing you can expect to get with Cassandra

The second one would be the performance and scale, especially multi-region scale. I think Cassandra was kind of one of the first to really focus on scaling across multiple regions and is still the best at that.

The third one is actually that we have always had a really good Java driver and we recently added reactive stream support. It’s a really productive and pleasant experience as a Java developer. So, if you’ve heard horror stories about how hard a Cassandra is to use, you might be surprised by how productive you feel with it.

Those are kind of the traditional answers. I would also say in the past year or so we started making some improvements in places where we’d been traditionally not as strong. Our main constituency has been Enterprise Java developers, but we want to make Cassandra available to everyone and not just that specific set of developers. Two things that tie into that are that as part of that emphasis on enterprise development, we’ve focused on development for big projects and the big processes to go with them. One of the specific ways that manifested is Cassandra has been unlike a lot of other options in the NoSQL space. Cassandra has very strong opinions about schema — you should have a schema and your data must conform to that schema. That’s a positive thing in the enterprise space, because if you have six different teams accessing the same database, then having that schema is a way for them to find common ground and avoid a lot of problems. The flip side is if you’re a young startup and you’re exploring a problem space, it’s just way more productive if you’re using schema-less JSON documents and evolving that schema on the fly as you’re building it.

What we did to address this problem is we created a project called Stargate that adds a schema-less, JSON API, REST, and GraphQL end point. Our goal is to meet developers where they are and not insist, “You have to use you know, Java with CQL (Cassandra Query Language),” but rather, “Hey, if you want to use Node.JS, Go, JSON docs instead of CQL,” we want to say yes to all of those, whereas in the past we would have said, “No, you should do it this way instead.”

Our broader goal is we want to democratize running Cassandra clusters in the same way that we’re democratizing building applications against Cassandra. The tip of the sphere there for running Cassandra is Kubernetes. And so that’s kind of where everyone’s collectively decided that they want to standardize their operations around.

And Cassandra has been a little bit late to that. I wouldn’t say only a little bit, though, because of course, Kubernetes grew up for those stateless workloads, and it’s only really been suitable for databases for a couple of years now. As part of our Astra, “Cassandra as a service,” we created a Kubernetes operator that we run Astra with. We open sourced it and we’re contributing it formally to the Apache project as part of that democratization effort.

JM: Sidebar, real quick: When you talked about running a database through the ringer to find the bugs, I’ve always been super impressed with how FoundationDB approached that in regard to their testing framework with the deterministic simulation. I’d be curious to hear your thoughts on that type of model for running a database through its paces.

JE: On the DataStax side, we built a tool called Fallout inside DataStax to basically allow you to compose different workloads and scenarios against a distributed system. In the Cassandra case, maybe I’m adding a new node to the cluster while it’s running a repair antientropy sync at the same time as I’m throwing a bunch of reads and writes at it. Maybe I’m also doing a backup at the same time, so I’m taking a snapshot. With Fallout, you can compose these scenarios and run them in parallel and Fallout will check, “Am I getting the results out of the database I should be expecting?” So, we actually open source that this year as well. And, you know, I think we’ve barely scratched the surface in terms of the benefits that taking a structured approach to that can bring.