Contents


Cloudant best practices for a multi-tenant service

Organize documents and optimize performance

Comments

IBM acquired Cloudant in February of 2014, and since then, it has proven to be a useful document-based storage system. Cloudant provides an intuitive dashboard with many of the tools necessary for managing big data, and a variety of client-side libraries make it easy to leverage Cloudant in your own application. However, as with any great tool, there are best practices one should follow and pitfalls one should avoid.

This article provides best practices for Cloudant around the organization of documents/databases, querying and deletion of documents, and performance optimization. We share the lessons we learned using Cloudant to store configuration data in a multi-tenant cloud service. We would not describe ourselves as Cloudant experts, but we have been working with Cloudant for over a year. We've learned a lot through trial and error, and since we couldn't find many real-world stories about using Cloudant to build multi-tenant enterprise systems in the field, we thought it might be useful to share our experiences with other architects and developers.

What we did

We used Cloudant to store configuration data for approximately 1,000 tenants. Most of these tenants relied on a small volume of configuration data (using around 200 documents each), while a few tenants relied on a large volume of configuration data (using around 500,000-2.5 million documents each). We encapsulated data stored in Cloudant by providing a REST API implemented in Node.js. End users did not have direct access to Cloudant. The REST API was hosted on IBM Cloud as a scalable set of microservices.

This article provides best practices for Cloudant around the organization of documents/databases, querying and deletion of documents, and performance optimization.

What we learned, and how to use this article

There are lots of great articles that teach you the basics of CouchDB and Cloudant. The goal of this article is not to cover those basics, but rather to provide some guidance around usage of Cloudant in a multi-tenant environment. If you're not already familiar with the basics, check out our "Related topics" section at the end of this tutorial for links to helpful articles. We recommend that you read those before proceeding here.

After we learned the basics, we weren't quite sure how to structure our data or design our search indexes. We will share what we learned below as a loosely ordered set of best practices. Please feel free to skip some sections if the later sections interest you more. We hope that some the lessons we learned will be helpful to you in your next project!

Organizing documents

It's a good idea to organize documents as a large number of small databases rather than a small number of large databases. For our project, we initially created one Cloudant database for every tenant. All documents related to that tenant were stored in its tenant-specific database. We ran into some indexing performance issues as the size of our tenant databases grew. Later, we reorganized the documents by creating one database per document type per tenant. As a result, we were able to achieve better performance. (We discuss our lessons learned in the "Optimizing Index Performance" section below.)

When working with various document types, separate each document type into a separate database. Many views and indexes only apply to specific types of documents. Using smaller databases with similar documents allows indexes to be smaller, and smaller indexes reduce the overall work performed by the Cloudant cluster when re-indexing. For example, a view that calculates the average "item" price will probably not examine "address" documents, so it's a good idea to create one database for items and another database for addresses.

Deleting documents

When you delete a document in Cloudant, Cloudant does not actually remove the entire document from storage. Instead, it leaves a tombstone with very basic information about the document. The tombstone contains the fields _deleted (boolean flag), _id, and _rev. Try to avoid a pattern of deleting a large number of documents, because the tombstones will continue to consume storage space for as long as the database exists. If you must delete a large number of documents frequently, then you should also delete the database that contains those tombstones on a regular schedule as well.

For example, consider an application that stores a large amount of data in Cloudant that only needs to exist for one month. If you simply delete the documents after they are no longer needed, the number of tombstones can, over time, easily outgrow the number of live documents. Instead, you should create a new database for every month of data, and delete the whole database when the data expires.

Querying documents

Cloudant provides multiple options for querying and retrieving documents. As you might expect, each option has benefits and drawbacks, so the best option for a specific use case depends on your usage scenario. In this section we'll discuss the query options and the scenarios where each worked better for us.

Primary index (search by document ID)

The primary index is the fastest way to retrieve data from your database. It comes with every Cloudant database, which means you don't have to write any code before you can use it. The primary index, often referred to as _all_docs, returns an ID, a key, and a value for every document in the database. The ID and key are the same (Cloudant makes an index keyed by doc ID), while the value is the _rev of the document. _all_docs also reports on the total number of documents and any offset used to query the index. If you know the ID of the document you want to query for, it is always more efficient to retrieve the document using it.

Pros:

  • Fastest query option
  • No extra storage overhead—keys are indexed automatically
  • Can query any number of documents at one time in a single request

Cons:

  • Least flexible query option—can only query by ID

Secondary index (MapReduce View)

The secondary index is a mechanism for working with document content in databases. It can selectively filter documents, speed up searching for content, and pre-process the results before they are returned to the client. There is a performance cost associated with indexing views, which we discuss in more details in the section "Optimizing index performance."

Pros:

  • Can retrieve results for any number of documents in one request
  • Can reduce query results on the server to increase performance and reduce data transmitted over HTTP

Cons:

  • Custom JavaScript reduce functions do not always scale very well, so try not to use custom functions; use the built-in reduce functions when possible for better performance.
  • Storage overhead—keys emitted by map function are stored for every document in the database.

Cloudant search (search by specific fields in the document)

Search indexes, defined in design documents, allow databases to be queried using Lucene Query Parser Syntax. Search indexes are defined by an index function, similar to a map function in MapReduce views. The index function decides what data to index and store in the index.

Pros:

  • More flexible than a Map Reduce View—any indexed field can be searched by itself or in conjunction with other indexed fields, so you don't need to do as much forward-thinking to make a search index flexible with future search requirements.

Cons:

  • Can only retrieve up to 200 documents at a time—when a search yields more than 200 results, a bookmark must be used to retrieve the next page of results in a subsequent HTTP query.
  • Storage overhead—fields are indexed for every document in the database.

The following diagram visualizes the relationship between the different query options available in Cloudant. The primary index offers the best performance with the least flexibility, while Cloudant Search offers the most flexibility at the cost of performance.

Cloudant query options
Cloudant query options
Cloudant query options

This brief video gives a solid overview of primary index, secondary index, and Cloudant search, along with other search options.

Optimizing query performance

You should rely exclusively on built-in reduce functions. We found that custom JavaScript reduce functions do not perform well with a large number of documents. If you need to query more than 200 documents with a single operation, use a Secondary Index instead of Cloudant Search. Whenever possible, use the stale option for queries where you can tolerate slightly stale data. If you're paging through a large number of results of a view, use the startkey / startkey_docid options instead of the skip option, as described in the CouchDb wiki:

The skip option should only be used with small values, as skipping a large range of documents this way is inefficient (it scans the index from the startkey and then skips N elements, but still needs to read all the index values to do that). For efficient paging you'll need to use startkey and limit. If you expect to have multiple documents emit identical keys, you'll need to use startkey_docid in addition to startkey to paginate correctly. The reason is that startkey alone will no longer be sufficient to uniquely identify a row.

Optimizing index performance

Whenever you create or update a design document, Cloudant populates an index for every view in that document. These indexes are important because they help your queries run faster. However, it's also important to understand how the indexing process works, because indexing a database with a large number of documents or complex views can cause unexpected side effects.

For example, indexing is a locking operation, so you won't be able to use the index while it's being built (unless you use the stale option as discussed in the "Optimizing query performance" section). In addition, the entire Cloudant cluster can become unresponsive or unstable during periods of heavy indexing, so you need to pay close attention to the performance of your indexing code.

When Cloudant indexes a database for a specific view, it runs the map function for every document in the database and stores the keys emitted by the map function for future reference. Databases with a very large number of documents take a long time to index because the map function must be run for every single document in the database, regardless of the number of keys emitted for each document. Views with a very complex map function take a long time to index as well, since that complex logic must be executed over and over for each document.

The indexing process runs in the background, and the HTTP command that you submit to create or update the design document will return before indexing has completed. It's possible to monitor indexing progress through the Cloudant UI or monitoring API, but it's not possible to estimate how much time a particular indexing task will take to complete with 100% accuracy.

In our experience, the following practices reduced the performance cost of indexing our databases.

  • Minimize the number of views/indexes in each database.
  • Strictly define one view or index per design document. This practice allows you to update specific views/indexes without needing to re-index all of the documents in your database for all of your unchanged views/indexes.
    • Case 1—1 design document with 50 views. If you need to update one view, then you need to update the whole design document in Cloudant. This update causes Cloudant to re-index all documents in the database for all 50 views.
    • Case 2—50 design documents with 1 view each. If you need to update one view, then you only need to update one design document in Cloudant. This update causes Cloudant to re-index the changed view, but all unchanged views do not need to be re-indexed.
  • Views and indexes should apply to all (or most) of the documents in their corresponding databases. If your view/index doesn't apply to all the documents in your database, then you should either generalize the view/index to apply to all documents in the database, or move unrelated documents into separate databases. This practice reduces indexing time by running the index only on appropriate documents (which is potentially a much smaller set of documents, depending on how you structure your database).
  • For databases with a very large number of documents, you should update design documents one at a time. In other words, if you have 100 tenants with 2.5 million documents each, do not create or update the same design document in each database at the same time. Instead, create or update the design document for one database, monitor the Cloudant cluster, and wait until indexing completes—and then repeat for the next database. This will keep the cluster in a responsive state by not overloading it with re-indexing tasks at one time.
  • Never update design documents. Instead, create a new design document and later delete the old document. Many production systems must operate without any down time, and often these systems will only promote new code while old code remains running at the same time. This practice allows the old code to work with the old view, while the new code works with the new view at the same time.
  • It's better to emit lots of different keys for the same document than it is to define multiple views/indexes for the same document. For example, if you have a database full of "Person" documents, do not define one view that emits a key for the first name (for example, "findPeopleByFirstName"), and another view that emits a key for the last name ("findPeopleByLastName"). Instead, define one view that emits keys for each attribute that you intend to query on ("findPeople"). This practice reduces indexing time because a smaller number of views/indexes must be run on every document update, but potentially increases query time since the number of keys that you query against is larger.

Optimizing client performance

All requests to Cloudant must go over HTTP, and each HTTP request carries a performance cost. Cloudant supports a variety of client libraries that encapsulate the HTTP details, and allow you to invoke abstract operations such as running a view or creating a document. For most clients, the performance cost of the underlying HTTP request is mostly unnoticeable while working with low request volumes. However, with high request volumes typically associated with multi-tenant applications, the amount of time invested in HTTP requests can potentially cripple your application's performance.

  • Fine tune your HTTP client to be efficient with a large volume of requests. Specific settings will vary depending on the client technology you choose. We used the "nodejs-cloudant" library and found a significant performance improvement by reusing the existing HTTP connections with the HTTP Agent setting [KeepAlive = true]. HTTP Client implementations in other languages (like Java) refer to this concept as HTTP Connection Pool.
  • Update documents in bulk instead of creating or updating documents one at a time. Remember that each create, update, and delete operation for a single JSON document is also a single HTTP request. If your application only creates or updates one document every few days, then the cost of a single HTTP request is very low, and few optimizations are required. However, if your application creates or updates hundreds of documents every second, then using a separate HTTP request per document is far less efficient than using a single HTTP request for all documents in bulk. Cloudant bulk operations can handle create, update, and delete operations for any number of documents within a single HTTP request, so you can simplify your application logic by reusing the same code for all three operations. It's a good idea to implement your application code exclusively with support for bulk operations, because you can easily implement single-document functions as a special case of a bulk operations request. It's also a good idea to aggregate update requests for individual documents. For example, if 100 users of your application send requests to update 100 different documents during the same second, then it's better to send a single HTTP request to Cloudant instead of 100 HTTP requests.
  • Limit the number of records that can be returned by a query using the limit option. Results that are too big can cause out of memory problems in your client.

Conclusion

In this tutorial, we have discussed best practices for improving the reliability and scalability of Cloudant for a multi-tenant service. This included organizing documents and databases, querying and deleting documents, and performance considerations. Organizing documents and databases is a key factor in achieving scalability, and optimizing performance is one of the most important factors in maintaining reliability. We also provided best practices for improving client-side performance and reducing load on the service.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing
ArticleID=1033323
ArticleTitle=Cloudant best practices for a multi-tenant service
publish-date=06132016