Cloudant advice to new users, from the people who design the product and run the service
So you’re new to IBM Cloudant, but you’re not new to database systems. As a Cloudant offering manager and engineer for the past four years, I’ve had the chance to see the product from all angle—the customers who use it, the engineers that run it, and the folks who support and sell it.
In many ways, this article is inspired by Dimagi’s perspective as users. I’d like to add our perspective as providers of the Cloudant database service and summarize the best—and worst!—practices we see most often in the field.
Rule 0: Understand the API you are targeting
You may use Java, Node.js, or some other use-case specific language or platform that likely comes with convenient client-side libraries that integrate Cloudant access nicely, following the conventions you expect for your tools. This is great for programmer efficiency, but it also hides the API from view.
This abstraction is what you want, of course—the whole reason for using a client library is to save yourself repeated, tedious boiler-plating—but understanding the underlying API is vital when it comes to troubleshooting and reporting problems. When you report a suspected problem to Cloudant, it enables us to help you if you can provide a way for us to reproduce the problem.
This doesn’t mean cutting and pasting a hefty chunk of your application’s Java source verbatim into a support ticket, as we’re probably not able to build it. However, it does mean introducing uncertainties as to where the problem may be—your side or our side?
Cloudant’s support teams will usually ask you to provide the set of API calls, ideally as a set of curl commands that they can run, to demonstrate the issue. Adopting this approach to troubleshooting as a general rule also makes it easier for you to pinpoint where issues are failing. If your code is behaving unexpectedly, try reproducing the problem using only direct access to the API. If you can’t, the problem isn’t with the Cloudant service itself.
If you suspect that a problem you’ve encountered lies with an officially supported client library, then try to construct a small, self-contained code example that demonstrates the issue with as few other dependencies as possible. If you’re using Java, it is helpful to us if you can use a minimal test harness to highlight library issues.
Occasionally, Cloudant receives support tickets stating that “Cloudant is broken because my application is slow” without much in terms of supporting evidence. These tickets are nearly always traced back to issues in the application code on the client side or misconceptions about how Cloudant works.
Not always, but nearly always.
By understanding the API better, you also gain experience in how Cloudant behaves, especially in terms of performance.
Cloudant API docs.
Rule 1: Documents should group data that mostly change together
When you start to model your data, sooner or later, you’ll run into the issue of how your documents should be structured. You’ve gleaned that Cloudant doesn’t enforce any kind of normalization and that it has no transactions of the type you’re used to from, say, Postgres, so the temptation can be to cram as much as possible into each document, given that this would also save on HTTP overhead.
This is often a bad idea.
If your model groups information together that doesn’t change together, you’re more likely to suffer from update conflicts.
Consider a situation where you have users, each having a set of orders associated with them. One way might be to represent the orders as an array in the user document:
To add a new order, I need to fetch the complete document, unmarshal the JSON, add the item, marshal the new JSON, and send it back as an update. If I’m the only one doing so, it may work for a while. If the document is being updated concurrently or being replicated, we’ll likely see update conflicts.
Instead, keep orders separate as their own document type, referencing the customer id. Now the model is immutable. To add a new order, I simply create a new order document in the database, which cannot generate conflicts.
To be able to retrieve all orders for a given customer, we can employ a view, which we’ll cover later.
Where possible, avoid constructs that rely on updates to parts of existing documents. Bad data models are often extremely hard to change once you’re in production.
Cloudant guide to data modeling.
Rule 2: Keep documents small
Cloudant imposes a max doc size of 1 MB. This does not mean that a close-to-1-MB document size is a good idea. On the contrary, if you find you are creating documents that exceed single-digit KB in size, you should probably revisit your model. Several things in Cloudant becomes less performant as documents grow. JSON decoding is costly, for example.
Rule 3: Avoid using attachments
There are a few things to consider before using attachments in Cloudant today, especially if you’re looking at larger assets such as images and videos:
Cloudant is expensive as a block store.
Cloudant’s internal implementation is not efficient in handling large amounts of binary data.
So—slow and expensive.
It’s ok for small assets and/or occasional use, but as a rule, if you need to store binary data alongside Cloudant documents, it’s better to use a separate solution more suited for this purpose and store only the attachment metadata in the Cloudant document. Yes, that means some extra code you need to write to upload the attachment to a suitable block store of your choice, verify that it succeeded before storing the token or URL to the attachment in the Cloudant document.
Your databases will be smaller, cheaper, faster, and easier to replicate.
Rule 4: Fewer databases are better than many
If you can, limit the number of databases per Cloudant account to 500 or fewer. While there is nothing magical about this particular number (Cloudant can safely handle more), there are several use cases that are adversely affected by large numbers of databases in an account.
The replicator scheduler has a limited number of simultaneous replication jobs it is prepared to run. That means that as the number of databases grows, the replication latency is likely to increase if you try to replicate everything contained in an account.
There is an operational aspect which is the flip side of the same coin: Cloudant’s operations team also relies on replication in order to move accounts around. By maintaining a smaller number of databases, you help us help you (should you need to shift your account from one location to another).
So, when should you use a single database and distinguish between different document types using views, and when should you use multiple databases to model your data? Cloudant can’t federate views across multiple databases, so if you have data that is unrelated to the extent that they will never be “joined” or queried together, then that data could be a candidate for splitting across multiple databases.
Rule 5: Avoid the “database per user” anti-pattern like the plague
If you’re building out a multi-user service on top of Cloudant, it is tempting to let each user store their data in a separate database under the application account. That works well, mostly, if the number of users is small.
Now add the need to derive cross-user analytics. The way you do that is to replicate all the user databases into a single analytics DB. All good. Now, this app suddenly becomes successful, and the number of users grow from 150 to 20,000. Now we have 20,000 replications just to keep the analytics DB current. If we also want to run in an active-active DR setup, we add another 20,000 replications and basically the system will stop functioning.
Instead, multiplex user data into fewer databases or shard users into a set of databases or accounts, or both. That way there is no need to replicate to provide an analytics DB, but auth becomes more complicated as Cloudant only provides authentication at the database level.
It’s worth stating that the “database-per-user” approach is tempting because Cloudant permissions are “per database”—it’s not really the users’ fault that this pattern has emerged.
If you find yourself writing reduce functions, stop and consider if you could re-organise your data so that this isn’t necessary or so that you’re able to rely on the built-in reducers.
Cloudant docs on reduces.
Rule 7: Understand the trade-offs in emitting data or not into a view
As the document referenced by a view is always available using
include_docs=true, it is possible to do something like the following in order to allow lookups on
This has advantages and disadvantages:
The index is compact—this is good, as index size contributes to storage costs.
The index is robust—as the index does not store the document, you can access any field without thinking ahead of what to store in the index.
The disadvantage is that getting the document back is more costly than the alternative of emitting data into the index itself because the database first has to look up the requested key in the index and then read the associated document. Also, if you’re reading the whole document, but actually need only a single field, you’re making the database read and transmit data you don’t need.
This also means that there is a potential race condition here—the document may have been deleted or changed between the index and document read (although unlikely in practice).
Emitting data into the index (a so-called “projection” in relational algebra terms) means that you can finetune the exact subset of the document that you actually need. In other words, you don’t need to emit the whole document. Only emit a value which represents the data you need in the app (i.e., a cut-down object with minimal details). For example:
Of course, if you change your mind on what fields you want to emit, the index will need rebuilding.
Cloudant Query’s (CQ) JSON indexes use views this way under the hood. CQ can be a convenient replacement for some types of view queries, but not all. Do take the time to understand when to use one or the other.
Rule 8: Never rely on the default behavior of Cloudant Query’s no-indexing
It’s tempting to rely on CQ’s ability to query without creating explicit indexes. This is extremely costly in terms of performance because every lookup is a full scan of the database rather than an indexed lookup. If your data is small, this won’t matter. But as the dataset grows, this will become a problem for you and for the cluster as a whole. It is likely that we will limit this facility in the near future. The Cloudant dashboard allows you to create indexes in an easy way.
Creating indexes and crafting CQs that take advantage of them requires some flair. To identify which index is being used by a particular query, send a POST to the
_explain endpoint for the database with the query as data.
Cloudant Query docs.
Keep going with Part 2
To continue reading, check out Part 2 of the Cloudant Best (and Worst) Practices.