Database Deep Dives: Cassandra

13 min read

Welcome back to the latest installment of Database Deep Dives. In this interview, we caught up with Jonathan Ellis, co-founder of DataStax and Apache Cassandra PMC member.

jonathan ellis banner

In this session, we talk about the origins of DataStax, when you should or should not use Cassandra, and future for C*.

Joshua Mintz (JM): Hi, Jonathan. Thanks for joining. I would love it if you could tell me a little bit about yourself and how you got involved in the Apache Cassandra community and DataStax, Inc.

Jonathan Ellis (JE): Thanks, Josh. It's great to great to be here. The medium-length version of the story is that dipping my toe into the world of big data happened in 2006 when I built a storage backend for a backup provider called Mozy. Effectively, we built a version of S3 (Amazon Simple Storage Service) that was specialized for doing backups, which meant that what we mostly cared about was write throughput, and we definitely didn't prioritize read latency very highly. That led us to, you know, making those tradeoffs — let us offer a very cost-effective solution. But one of the things that was tricky about it was we wanted to do single-instance storage. What I mean by that is, if you and I both back up a copy of the same file — like maybe we both have Microsoft Word installed —  I want to only have one instance of those files in my storage system and not, you know, a hundred thousand of those.

So, the complexity there — we were doing a content addressable storage, so actually deduplicating it on this storage backend was not the problem. The problem was how do we do garbage collection? If I've got a 100,000 people who backed it up, how do I know when the very last one of those deletes it and then it's safe to actually remove those bits from my storage system? This is the kind of problem that you would traditionally use a database for — but there wasn't, in 2006, a database that you could use off the shelf to handle mapping billions of files to millions of user accounts. I realized that this was going to be a generational problem that the industry was going to run into as people started to move toward mobile-first development, cloud-first development, and away from the kind of software where you install it for one company onsite and that's how you use it.

A good example — if you look at Atlassian, they're a family of products. Jira is used ubiquitously in the software development world. Back in the day, you used to install Jira on your own servers and maintain that. And they've moved to a cloud-first model, and they recently switched to a cloud-only model. So, if you want to use Jira, they're not selling new licenses for you to run onsite anymore, you need to go to jira.atlassian.com.

The main scaling problem, like scaling a stateless service —  that's easy. You just scale horizontally, throw more Nginix  boxes at it. But scaling the database part, that's the hard part, that's the interesting part.

So, after Mozy, Rackspace asked me to join them in San Antonio, Texas, to build a next-generation scalable database for them to use internally as they built out their cloud services. This was in the fall of 2008. So that was super interesting for me. I got to evaluate . . . at the time, there was MongoDB, there was Cassandra [which] had just been open sourced, there were a couple others that you don't hear much about anymore —  Voldemort from LinkedIn or Dynomite from PowerSet.

The thing about Cassandra that made it challenging was that the guys at Facebook had built it, but they were open sourcing it not from the view of, “Hey, let's create a community,” but more from the view of, “Hey, this is cool tech we wrote. And if it's useful to you, enjoy; and if it's not, like don't complain to us, we've got other things to worry about.”

 That was the main thing that gave me pause when looking at Cassandra. We're at the very beginning of this era of scalable databases. It wasn't even called NoSQL at the time. I decided I'd rather prioritize getting the foundational technology right and optimizing for the long game. I thought Cassandra really had that best foundation to build on. That’s what I put weight behind.

Facebook's last bit of involvement was contributing it to the Apache foundation and becoming the first committer on the Apache project, and then the project chair when it graduated from incubation. In April of 2010, I started DataStax to commercialize it.

The breadth and complexity of the database industry from the Cassandra point of view

JM: You talked about a variety of different databases that are out there at the time of DataStax’s creation. So, that problem (or I don't necessarily want to call it a problem), but that situation still exists today and probably has increased by some order of magnitude. Right? When I look at the industry, I see a wide, wide variety of database vendors to pick from or open source database technologies to use. So, from your perspective, having been there since Day Zero with Cassandra and DataStax, why should someone use Cassandra versus myriad arrays of the other databases out there to pick from?

JE: A friend of mine named Andy Pavlo — he's a professor at Carnegie Mellon — runs a web page called the dbdb: The Database of Databases. He just kind of collects database products to preserve a little about them for history because not all of them last very long. Last I checked, he has over 700 entries there. So, there's definitely a whole lot of options if you want to look for a database to use for your next project. Realistically, there's only a handful [of options] that are mature enough to solve general purpose problems. If you look at those 700 options, some of them are really specifically targeted at a very narrow niche and others are kind of hobbyist projects.

If you want to narrow it down on industry standards, you're probably better off looking at like the top 10 from the DB Engines ranking, where they look at how many people are using these, how many jobs are being created around these, and those types of signals.

If I were to evangelize a Cassandra today, I’d point at a couple of things. First of all, anything in those top-10 databases have one thing in common —  they are not brand new. It takes time to stabilize the database and get the bugs beat out of it, especially if you're talking about a distributed system. There is just no replacement for getting a lot of people to use it and shaking it down. Eric Raymond famously said that  “with enough eyes, all bugs are shallow.” That’s not literally true, but there's a lot of truth to the spirit of it. I don't think there's a shortcut to getting that kind of quality without getting that many eyes on it.

The traditional Cassandra strengths over the years have been around bulletproof stability, performance and scale, and Java development experience.

From a stability point of view, I talked to a DataStax customer recently that has been a customer for over six years, and they've had zero downtime. This wasn’t toy use case. The database is doing hundreds of thousands of operations a day, it has been through multiple releases during that time and had zero downtime. That’s a realistic achievement with Cassandra. It's not like the customers are in some 99th percentile special case. Zero downtime is a real thing you can expect to get with Cassandra

The second one would be the performance and scale, especially multi-region scale. I think Cassandra was kind of one of the first to really focus on scaling across multiple regions and is still the best at that.

The third one is actually that we have always had a really good Java driver and we recently added reactive stream support. It’s a really productive and pleasant experience as a Java developer. So, if you've heard horror stories about how hard a Cassandra is to use, you might be surprised by how productive you feel with it.

Those are kind of the traditional answers. I would also say in the past year or so we started making some improvements in places where we'd been traditionally not as strong. Our main constituency has been Enterprise Java developers, but we want to make Cassandra available to everyone and not just that specific set of developers. Two things that tie into that are that as part of that emphasis on enterprise development, we've focused on development for big projects and the big processes to go with them. One of the specific ways that manifested is Cassandra has been unlike a lot of other options in the NoSQL space. Cassandra has very strong opinions about schema —  you should have a schema and your data must conform to that schema. That’s a positive thing in the enterprise space, because if you have six different teams accessing the same database, then having that schema is a way for them to find common ground and avoid a lot of problems. The flip side is if you're a young startup and you're exploring a problem space, it's just way more productive if you're using schema-less JSON documents and evolving that schema on the fly as you're building it.

What we did to address this problem is we created a project called Stargate that adds a schema-less, JSON API, REST, and GraphQL end point. Our goal is to meet developers where they are and not insist, “You have to use you know, Java with CQL (Cassandra Query Language),” but rather, “Hey, if you want to use Node.JS, Go, JSON docs instead of CQL,” we want to say yes to all of those, whereas in the past we would have said, “No, you should do it this way instead.”

Our broader goal is we want to democratize running Cassandra clusters in the same way that we're democratizing building applications against Cassandra. The tip of the sphere there for running Cassandra is Kubernetes. And so that's kind of where everyone's collectively decided that they want to standardize their operations around.

And Cassandra has been a little bit late to that. I wouldn't say only a little bit, though, because of course, Kubernetes grew up for those stateless workloads, and it's only really been suitable for databases for a couple of years now. As part of our Astra, “Cassandra as a service,” we created a Kubernetes operator that we run Astra with. We open sourced it and we're contributing it formally to the Apache project as part of that democratization effort.

JM: Sidebar, real quick: When you talked about running a database through the ringer to find the bugs, I've always been super impressed with how FoundationDB approached that in regard to their testing framework with the deterministic simulation. I'd be curious to hear your thoughts on that type of model for running a database through its paces.

JE: On the DataStax side, we built a tool called Fallout inside DataStax to basically allow you to compose different workloads and scenarios against a distributed system. In the Cassandra case, maybe I'm adding a new node to the cluster while it's running a repair antientropy sync at the same time as I'm throwing a bunch of reads and writes at it. Maybe I'm also doing a backup at the same time, so I'm taking a snapshot. With Fallout, you can compose these scenarios and run them in parallel and Fallout will check, “Am I getting the results out of the database I should be expecting?” So, we actually open source that this year as well. And, you know, I think we've barely scratched the surface in terms of  the benefits that taking  a structured approach to that can bring.

Broader industry trends and exciting new features for DataStax and Cassandra  

JM: In the last few years, we have seen the industry rotate eventually consistent databases back toward transactional-consistency models — whether it’s MongoDB introducing multi-document transactions, CouchDB adopting FoundationDB in 4.0, or managed services like DynamoDB and Cosmos letting folks toggle consistency mode. Does Cassandra see itself introducing anything similar?

JE: Yeah, that's a really perceptive observation. Cassandra was, in some ways, early to that trend. We introduced what we call lightweight transaction. I think it was in 2015 and it allowed you to kind of opt into a stronger consistency model on a per-query basis.

If I'm completely honest, I would say that we're a little bit behind right now. We did that in 2015 and then we stopped there. This left us in a serializable consistency. But if you need 50% of your operations or 90% of your operations to require that, then it's not a good fit. Now, I do still think that requiring that level of consistency is uncommon, but it is true that we do have that gap to make up.

And so, if I were to look in my crystal ball, then I would say, yeah, Cassandra is probably going to move its lightweight transaction implementation from the raw Paxos that it’s built on now to either a multi-Paxos or Raft implementation. What that's going to do is it's going to reduce the contention and the overhead of doing that versus a normal eventually consistent operation and make that something that you can reach for more often without having to think about the tradeoffs quite so hard.

JM: I want to say, like, five or six years ago, I felt like I saw a phase personally where a lot of companies are trying to put their whole stack onto Cassandra. They were aiming replace everything and wanted Cassandra to be the source of truth, the only database, the database to rule them all. And I personally saw some interesting failures and successes. I would be curious if you could share any anonymous stories of particularly bad or good fits.

JE: I’ve kind of always taken the standpoint of — I want to make Cassandra the best tool for the job for as many people as possible, but with the recognition that there's still going to be this category of people for whom it's not the best tool for the job. If it’s not, then I don't want you to use it, right? I want you to use what's going to make you successful. If that's not Cassandra, then you know, don't use it. The bright line for me has been, if your data fits on a single machine, then don't use Cassandra. Cassandra makes a bunch of tradeoffs to be as scalable as it is and to get to that multi-region kind of replication. Those tradeoffs are not valuable if you're just using kind of a single-image database.

 There’s a whole category of problems where it's like, yeah, keep using MySQL, keep using PostgreSQL. You're going to be happier. Your Ops team is going to be happier. Everything’s going to be simpler if it fits your use case. There was kind of like that NoSQL hype wave in 2011 or so where Cassandra was the hot new thing. So maybe there was some resume padding going on where technologists are looking at, ”What can I work on that's hot and cool?” and maybe cramming projects onto it that aren't a great fit.

 I think we're past that now. I think we're into kind of that more mainstream phase of crossing the chasm and not in that innovators’ phase. And so, there are some upsides and downsides to being that more mature enterprise product, but, on the whole, I’m happy with where it’s at.

I do remember distinctly that we had — this was at one of the first Cassandra summits, so it was probably at that 2011 hype wave stage I just mentioned — we had a gentleman give a talk called “Cassandra for Small Data” that really challenged my assumptions. He said: “People were telling me not to use Cassandra if you don't need the scale, but all of my data fits on a single machine. And I'm actually really happy with Cassandra because I'm replicating across seven data centers. And that's my use case. And Cassandra actually does that really well.”  So now I like to add that little footnote there about my bright line with data fitting on one machine.

JM: That's a very fair footnote. Because personally, as a product manager for IBM Cloud® Databases, that style of use case is not rare. The one where you need global scale, whether it be for latency purposes for your users or regulatory requirements. Customers often need to have that data availability over different AZs (availability zones), different regions, or even different clouds. So, I look forward to seeing how that evolves, especially with the proliferation of legislation around data residency.

Last question. Here at IBM Cloud, we're running DataStax as a managed service. But that doesn't mean we aren't as excited about Cassandra 4.0 as everyone else. Could you talk a little bit about the operational stability improvements and other big feature drops that you're excited about with this release?

JE: The headline for 4.0 is stability. We fixed over a thousand bugs. There's been a big emphasis here — DataStax open sourced Fallout and Apple open sourced Harry. Harry is a fuzz testing tool and it basically generates, you know, sort of random data to throw at your Cassandra cluster to turn up edge cases and expose bugs that people might not have thought of yet. In terms of the features you can expect, they are primarily around, how do I run my Cassandra cluster more successfully and how do I operate it more successfully?

One of those is around virtual tables, or a synonym for virtual tables would be system views. The traditional way of exposing what's going on inside your Cassandra cluster is through a Java API called JMX, and JMX is reasonably good at what it does. It basically gives you introspection into what's going on inside the JVM for a very low effort on the developer’s part. The downside is it doesn't play super nicely with anything that's not Java. And so, like with Kubernetes taking over the world and it being very Go-oriented, people are looking at Cassandra and saying, “Well, what do I do with all this JMX stuff?” We are taking two approaches to that.

One is virtual tables, which exposes information about the system as it happens over CQL. You can use any CQL client to pull data out of those virtual tables. The other is we've created a REST-based sidecar specifically to work nicely with Kubernetes operators. So, both of those are happening concurrently.

 In the vein of making things easier for operators, there's a new audit-logging set of functionalities, so you can see who's created tables, who's dropped tables, you know, get that logged securely. In a related, but different note, there's also full-query logging. And so that's not for auditing so much as I want to capture some of my workload and replay it against another cluster for testing.

Besides those operational improvements, there [have] also been some improvements on the performance side. Cassandra is mostly focused on throughput and how do I do as many Ops per second as possible, but now with 4.0 on consistent p99 latencies? So, on the 1% of my slowest queries, how do I reduce the gap between those slowest queries and the fast queries?

Learn more

At IBM Cloud we offer IBM Cloud Databases for DataStax, a highly available, production-ready managed database-as-a-service with a 99.99% SLA. It provides automatic backups, security and performance patches, and deep integration with IBM Cloud's monitoring, logging, auditing, and encryption key management technologies. It is PCI-DSS, ISO - 27001, 27017, and 27018, and HIPAA-ready.

If you enjoyed this content, you can check out our other Database Deep Dives:

Be the first to hear about news, product updates, and innovation from IBM Cloud