Welcome to Database Deep Dives, a series of interviews where we are talking up with builders, engineers, and leaders across the wide world of databases.
We recently had the pleasure of catching up with Adam Kocoloski (@kocolosk) and Jan Lehnardt (@janl) from the Apache CouchDB Project Management Committee. Check out the interview below to learn more about the strengths and weakness of Apache CouchDB, where the project is heading, and their expert advice for people looking to run CouchDB on Kubernetes.
Tell us a little about yourself and what you are working on today?
Adam Kocoloski (AK): Sure, my name is Adam Kocoloski, and what I’m working on these days is the technical strategy for databases and data services in the IBM Cloud. We have this whole portfolio of databases and associated things for schlepping data around, analyzing it, and extracting insights from it. My job is to try and sort of guide that into a coherent portfolio that meets the needs of our clients and puts us on a strong competitive footing with the rest of the market.
Jan Lehnardt (JL): I’m Jan Lehnardt. I have been doing open source stuff for about 20 years, CouchDB for 12 years. I started professionalizing my work services and support for Apache CouchDB with my company, Neighbourhoodie, about five years ago. When I’m not running that, I’m working on CouchDB 3.0 and 4.0, which we decided to do at the same time and is very exciting. Also, working on Opservatory—which we recently announced—a continuous observation tool for CouchDB installations.
How did you get involved with CouchDB?
AK: I got involved with CouchDB because my colleagues and I at MIT were toying with the idea of starting a company. One of the ideas we had was in the database space. We saw, at the time of 2007 and 2008, that there was a diversity of people exploring different types of databases from what traditionally had been the one-size-fits-all database that you used to back your LAMP stack application. As practicing physicists, we had a whole bunch of pragmatic experience dealing with non-traditional data sets and non-traditional ways of managing them. We thought that this might be an interesting way for us to apply our skills.
We came across CouchDB because it seemed to share a lot of the same kinds of principles we had in terms of its approach to data distribution, its focus on data safety and durability, and in terms of its embrace of the web as the right medium for interacting with your data as you were building your application. We figured that rather than sitting down and building something that would end up looking and feeling like CouchDB, that it would be more effective for us to go and get involved in the community and see if we could contribute there. That would be the foundation for what we wanted to build in our company, which was Cloudant.
JL: I was working on the LAMP stack, mainly as a consultant. I eventually specialized in getting teams to thinking “scalable,” using all the benefits of the shared-nothing architecture in PHP, and rudimentary best practices. I found CouchDB on a blog somewhere and looked into it a little bit. Within a week, I got existentially panicked because if CouchDB were to catch on, I would be out of a job real quick. I thought I better figure out how to get proficient with it and see where it was going. It didn’t pan out the way I thought it would, with CouchDB replacing everything. But I was not wrong that something which looks and feels like CouchDB is now a major player in databases. That’s basically how I started and I never left.
In your opinion, what are the strengths and weaknesses of CouchDB?
JL: There is one thing that makes CouchDB unique, [which] is if you go through the motions of building a database system, you will be asked “why are you doing this?” The reason for existence for CouchDB is its unique replication capabilities, which can be from low-level peer-to-peer (like IoT or mobile devices collecting data and talking to each other) to full multi-region cluster-to-cluster replication syncing data around. It’s the same technology that allows these use cases with CouchDB that no other database really has in that shape or form. Specifically, the replication for CouchDB works more like Git than MySQL replication.
Its main strength compared to other databases is its data durability; like Adam mentioned earlier, it just never loses any data! It’s really hard to mess that up. When designing CouchDB, we were opting for safe and sound by default, which leads to a lot of operational simplicity. It’s easy to get up and running with CouchDB and it will do what you need it to do if you don’t want to learn too much about.
We also spent a couple of years thinking through the REST API. That is still paying dividends in terms of being approachable, being useful in the long-term, and it has worked well for us. Personally, I think programming using document databases is more natural to program than other SQL databases. That might just be my opinion.
AK: I think I have to echo most of Jan’s comments there. As someone that had the temerity to start a database-as-a-service company in 2009 with this technology, I am grateful we placed a premium on data safety and durability. CouchDB just figures out how to get things back to health on its own 99 times out of 100, and we really do appreciate that. Clearly, replication enables a set of use cases that are so obviously the right tool for the job when you have a problem that’s shaped in the way that CouchDB thinks about the world and the way data ought to be replicated and synchronized among peers.
Those are definitely good strengths of the system. When it comes to weaknesses—it’s not easy to make CouchDB a low-latency system out of the box. You do have to be cognizant of all the things that go along with performant access to a web service. We see some clients make a new TLS connection with every request. There are a lot of handshakes, back and forth, going on there that’s really unnecessary, but because it’s a web service, if you want it to do that, CouchDB will happily negotiate with you on those connections. It will keep on running and staying available, but it’s hard to make that fast.
I would also say that, as a community, we have probably invested a little less in client libraries than some of the other projects out there might have. In part that’s because we say: “Hi, we got this nice REST API. It’s a web service, if you know how to talk to a web service you know how to talk to your database.”
While that’s true, we have seen from time to time that it has made the learning curve more steep than it could have been. Some databases have poured a ton of time, money, [and people] into building out a diverse set of client libraries that mask some of the details of the API from the end user and present a more idiomatic programming model in each of the popular frameworks and ecosystem. I think it’s something we learned from in recent years and have put more energy into. Nevertheless, it’s one place I’ll say we haven’t been as strong.
What are some of coolest or most interesting use cases you have seen with Couch?
AK: One thing I should say before getting to the cool, exotic stuff, is that we are really happy that CouchDB serves as the foundation for huge number of use cases in the IBM Cloud. It powers the IBM Cloud at a fundamental level and the cloud would go down if CouchDB were to go down someday. I think that’s cool, even if most of the use cases might be garden-variety, boring, applications in front of a database. It just works, and it scales, and it meets our needs.
We have seen people build a lot of mobile gaming stuff and mobile wallets. We have seen people take a collection of relational databases with a whole bunch of stored procedures and say: “My god, we can’t make any changes to this stack anymore, we've got stored procedures that refer to a company that went out of business 20 years ago! How do we take this all out and give our teams a development environment that lets us move at the required speed for the business?” A document database like CouchDB is a good fit there. It simplifies the data model and can accommodate quirks that might have emerged in disparate systems over many years.
JL: I have to echo Adam. My tagline for Couch is that, “it’s a fine general purpose database for 80% of any applications that can use any database, so why not use Couch?” We have seen some cool stuff though.
There is a company that is involved with shipping things around the planet. Anything that is bigger than a person could carry in their hand is managed through a multi-region CouchDB cluster. Another one is a company that does in-flight entertainment systems that have 3,000 planes in the air with CouchDB at any given time. We also have been involved in humanitarian crises, like the Ebola relief effort in 2016, where about four or five West African countries were basically stopped in their tracks because of the outbreak.
We happened to build offline-capable, first responder tooling to help manage that crisis faster than they could on pen and paper, which is the usual method of doing things without infrastructure like power, edge networks, or 3G. None of this would have been possible without CouchDB.
We also helped build a vaccination trial software for that environment that led to the first Ebola vaccine. It’s extremely humbling from a personal perspective. If we look at the big human achievements of the past 100 years, the first Ebola vaccine has to be on that list. To be personally involved, or even if I just worked on CouchDB, that’s awesome
What are you looking forward to in CouchDB 3.0 and 4.0?
JL: We decided to do CouchDB 3.0 and 4.0 at the same time for a good reason. They will ship in sequence but we are thinking and working on both at the same time. The main change with 4.0 is a major technology shift with the underpinnings of CouchDB. But before I get there, the existing underpinnings in 3.0 will be the best CouchDB we have ever made. Not in the like "latest and greatest" marketing terms, but from the point of view that there is a Top 10 of things that people either run into and complain about or ask about right away. CouchDB 3.0 will simply address all of them.
We are grateful to IBM for open sourcing a bunch of the stuff they built with Cloudant at IBM and we can add as open source, and we also are building a bunch of other stuff to make sure we are able to deliver the best Couch platform. It includes all our learnings from the past 10-15 years with CouchDB to give people that need to stay on that platform for a long time a very solid foundation. For CouchDB 4.0, maybe Adam you want to talk that one.
AK: Yeah, sure. So, the 4.0 release includes an adoption of distributed transactional key-value storage system called FoundationDB. This is something that I’m personally super excited about, and that for anyone that gets close enough to the project [also] gets excited about the potential here.
FoundationDB was a commercial piece of software that was acquired by Apple and shut down. Then it was subsequently open sourced a couple of years hence. What this gives us is the ability to run CouchDB environments that are able to scale out in a single region while still providing full strong consistency for the updates. If you look at what we have done in Couch 2.0 and what we will preserve with 3.0, there is a difference in the way we handle those updates and have individual replicas coordinating amongst each other.
And, while that’s been a durable and reasonably scalable design, it has left us with a few shortcomings that are tough to program around. It’s possible to get your database into a state where you can have edit conflicts in a single region, despite the fact you have on editor just kind of looping and writing to the database. That’s a behavior we are looking forward to eliminating with Couch 4.0 and the adoption of FoundationDB.
Another one is the way we do our view indexes in CouchDB, with a scatter-gather mechanism. It lets people deliver an awful lot of indexing throughput with a high-write database—sharding those things out and then each index will build its view of its shard in parallel. But querying those things is a reasonably expensive proposition. As that database gets larger, the cost of a query will continue to go up.
In contrast, when we have the views running against FoundationDB, we will have them reorganize and redistribute across a cluster of machines so that querying a view at scale is as inexpensive as retrieving individual documents. I think that opens up a whole host of additional high throughput/high-scale use cases for people adopting CouchDB. It’s a big shift. It really means that now the CouchDB logic is a stateless application layer deployed over top of FoundationDB, which takes care of all the statefulness, the persistence, the materialization of the data.
It’s not something we are taking lightly. We always place a huge premium on data safety, data durability, and doing right by people’s data. We took that same perspective into an analysis of FoundationDB and we are happy to say that their project shares the same sort of emphasis and focus as CouchDB does.
JL: I have spent time on stage talking about CouchDB and explaining to people what CouchDB is. Questions often come up about eventual consistency and replication, Dynamo model clustering in 2.0 or 3.0. The downsides of these are worth it because if you wanted to make a distributed, consistent database, you would have to take 10 of the top-tier distributed system engineers on the planet and given them like 5-10 years time dedicated to doing the right thing. No one has that time.
Turns out FoundationDB, the company before it was acquired by Apple, and the needs that Apple had were aligned with this. The [open] question for us was: "Should we translate our CouchDB into something that looked like FoundationDB or just build CouchDB on top of FoundationDB for a leapfrog?"
They have done the unthinkable, basically, and they have done it well. There is enough proof in the wild that they have done it right. It’s just very, very nice to see and we are lucky that it’s open sourced.
One final note—it’s a matter of procedure—but the CouchDB project has not officially said that 4.0 is what we just talked about. It will likely do that rather soon. In case any CouchDB developers are reading, there is some procedure involved in making it official.
It's exciting to hear about advancements in the core codebase, but the ecosystem around CouchDB isn’t standing still. With the advent of Kubernetes and friends, what advice would you give people to want to deploy CouchDB on containers?
AK: So, I guess I would say that choosing to deploy CouchDB on containers and especially orchestrating CouchDB with Kubernetes environment is a decision that will set you up to receive improvements going forward.
This is a place where we are investing time and energy to ensure it’s an out-of-the-box experience that works and handles the needs of a diverse array of applications. It is a moving target, not just because of where CouchDB is in its support of Kubernetes and container-based deployments, but the industry as a whole.
We see that one day everyone is building Helm charts to stand up databases, and the next day everyone is building operators to manage the lifecycle of those databases. It can be a lot to keep track of because the space moves so quickly. You have to pay attention and stay engaged with the community if this is something you want to do in production. But, it is a place we’re interested in driving support. I think it offers potential for delivering an experience where most of the configuration gets done correctly the first time.
Some of the complexities of setting up a distributed system are things we can do a better job of auto-detecting and automating in a constrained environment like Kubernetes, rather than VMs over in a VMware environment somewhere.
JL: Yeah, I would definitely agree with the potential part. But, I would mention that we do professional services for Couch, and one of the areas we make good money with right now is moving people away from running CouchDB in Docker. I think that’s where the downside comes in.
For certain workloads, the technology isn’t quite there yet. Not saying Docker is the only container solution or Kubernetes is the only distribution situation that we can get into, but CouchDB—as a distributed database—makes prime use of low-latency, high-bandwidth networking and extremely fast IO throughput and disk access.
In certain scenarios that are getting smaller and smaller, both Kubernetes and Docker get in the way of that. At that point, CouchDB gets slow, or you get timeout errors, that you can’t explain. In the grand scheme of things, if you build an app that knows how to retry things, it’s not a big problem, but it is a moving target.
I would suggest that unless you have a dedicated team responsible for keeping the services running on your Kubernetes cluster happy, especially for databases, that it’s out of scope for most use cases right now. Putting some automation around cloud VMs on IBM Cloud, Microsoft, or Google gets you most of the way there.
Generally, from a workload perspective, moving the data around is the more expensive part. So, scaling up more workers that do database things for an hours worth of spike—you will be waiting a half day before the data is on the new node before you can address that spike. So there is limited use in that [scale-out] scenario, which is why many people are interested in the first place from an application perspective (which it works fantastically for today.)
When it comes to what Adam just said, make sure that deployment and cluster management or distribution management of this all works well and is done declaratively instead of functionally. That’s where you will get the use out of it. If you have low resource requirements but want to get the benefits of running inside of a container or Kubernetes—sure, go for it, just know the [limits of certain metrics] and that you may be better moving off unless you have a dedicated team. It’s a mixed bag today, but I can’t wait for when that’s all over and the direction is pretty clear. Not saying this is never going to happen, just waiting for the underlying technologies to catch up with reality.
What advice do you have for people that might want to get involved in the community but haven’t worked with Erlang before?
But I think, in general, a comment about getting involved with open source projects. There can be a perception that sometimes you need to figure out how the deepest, darkest parts of the system work and deliver a crazy PR in order to get involved. That’s just not true. If nothing else, we can benefit from contributions all across the spectrum. Part of our job as members of the Project Management Committee is to encourage that diversity of contributions to ensure that we have a well-rounded, inclusive community for people to come in and get involved. The advice there is to raise your hand. We will absolutely find ways to include you and makes the best use of what you can bring to the table:
JL: That’s all very good advice, I would only add that while it might seem daunting at first, Erlang isn’t as bad or as hard as it might seem at first sight. It might look weird, but if you open your mind to it, you could submit productive patches after going through a bunch of tutorials right away. One of the things that helps—our web-based HTTP JSON API can be learned in an hour.
Then, all you have to do is look behind the scenes to see how the HTTP request is handled, how is this JSON parsed, then you are off to the races to go deep into the stack. We have helped people get to a minor CouchDB feature in an afternoon without any prior Erlang experience. Of course, we guided them, but if you have more patience it might take a weekend.
AK: Good point, Jan. I don’t know about you, but I learned Erlang by reading the CouchDB source code and working with Couch.
Thanks to our interviewees for taking the time to share their knowledge. If you'd like more of a broad overview of CouchDB, check out Adam's video, "CouchDB Explained":
Check out the other installments in this series: