Contents


Optimize queries in Cloudant

Comments

Applications require more data flexibility than ever. The emergence of various NoSQL databases in recent years has made up for the deficiencies of traditional relational databases. Their more flexible data model better supports the application requirements for unstructured and semi-structured data. As a type of NoSQL DB product, IBM® Cloudant offers fully-managed database services for web and mobile applications, providing abundant features such as advanced indexing techniques, custom views, full-text search, and real-time queries. This article provides several methods and experiences from multiple perspectives for you to use to optimize your queries when operating Cloudant NoSQL DB. This helps you have a more in-depth understanding of the most suitable queries for each applicable scenario, so as to provide the most efficient data query service for your application.

Cloudant introduction

Cloudant is a commercial release of Apache CouchDB that provides a full IBM Operation & Maintenance Data Management platform for web and mobile applications. As a NoSQL database built on the cloud, Cloudant is ideal for the rapid growth of web and mobile applications. It allows users to take advantage of the cloud's availability, resiliency, and reach to enable applications to scale to even greater levels and to optimize for better data availability, endurance, and mobility. Cloudant's powerful indexing capabilities and the ability to push data to the edge of the network across multiple data centers and devices speed up its access and increase its fault tolerance. It allows users to access data anytime, anywhere.

Compared to other database products, Cloudant has the following features:

  • It supports copy distribution and replication. Database read and write access can be given to each node in the cloud which enables Cloudant to run on any scale of clusters and ensure data security and consistency.
  • It has continuous data synchronization between replicas. Cloudant supports syncing between multiple replicas and supports real-time, automatic synchronization. The continuous updating of data in one copy can be automatically synchronized to other associated copies.
  • It provides a variety of methods to optimize your query, including indexes and views. In terms of supporting view, Cloudant supports user-defined views. The language it relies on is JavaScript.
  • It provides data analysis and warehousing for JSON documents. Cloudant supports integration with the IBM Db2 Warehouse for traditional data warehousing and business intelligence analytics. Cloudant also supports integration with IBM Db2 Warehouse for online reporting and business intelligence analytics.
  • It supports storing location-based data in GeoJSON format.

Cloudant query optimization with index

In databases, frequently used data and related queries are indexed to speed up queries. Cloudant supports two types of indexes:

  • "type": "text"
  • "type": "json"

There is a significant difference between these two types of indexes related to the purpose of use and the method of use.

From the perspective of purpose of use, the index of the text type focuses on the specific content of Cloudant documents, rather than the structure of the document itself. So, if a user is not familiar with the structures of the documents in the database, or the structures of the documents vary a lot and their formats are complex, text index will be preferred. In contrast, the JSON index has high requirements on the structure of the document, and it is built on one or more specific fields. Therefore, if the user is familiar with the document structure in the database, he or she can choose to create a JSON index. By explicitly specifying the appropriate fields, one can optimize all queries that contain the specified field on Cloudant. From this perspective, the concept of JSON indexes is similar to that of indexes in traditional relational databases, both of which are optimized for specific columns or fields.

From the perspective of method of use, the text index can only be used in the Cloudant data search interface, of which the supported syntax is Apache Lucene Query Parse Syntax. In contrast, the JSON index can only be used in Cloudant's data query interface, allowing users to query with a JSON object in accordance with the Cloudant query parsing syntax. Interfaces of both indexing type are powerful and support various custom queries. Even for the text-typed indexes, Cloudant allows users to precisely define the range of data to be indexed, for example, by identifying the specific fields with a conditional judgement. Therefore, in most scenarios (when the document structure is relatively simple and clear), these two can be converted into each other. However, one can only use the text index and data search interface when the document structure is complex and not clear.

In terms of the speed of indexing, the indexing of text type is slower than that of JSON type when dealing with the same amount of data. The reason is, when creating a text index, Cloudant deals with not only the specified data structure, but also its content. In contrast, the JSON index is only concerned with the structure itself. In some cases, the text index can be used for full-text search of the entire database. The code is as simple as shown below:

Example 1. Creating full-text search using text index

{
    "type": "text",
    "index": { }
}

In addition, some of Cloudant's advanced features, such as sophisticated aggregations and geospatial-based calculations, can only be used in query interfaces based on JSON index. Therefore, the key to selecting which of these two types of index to use lies in the understanding of existing data.

Query optimization based on Cloudant view

In traditional relational databases, views are used to filter data, control access, and optimize queries. In Cloudant, views are mainly used to filter data and optimize queries, having not much to do with data access control. This section explains relevant information about views and query optimizations.

View and index

Cloudant automatically creates an index for data in the view the moment it is created. Therefore, there is no need to index data included in a created view. The index associated with a view only take effect on view-based queries and takes precedence over the above two index types.

The index content is updated incrementally and automatically when any one of the following three events occurs:

  • A new document is added to the database.
  • An existing document is deleted from the database.
  • An existing document in the database is updated.

When the design document where the view definition resides in is updated, the index for that view will have a full update. Moreover, if the design document is updated, even if the view itself is not updated, it will cause full update to all the views defined by it. When the data is large, the full update of views will be time-consuming and will reduce the performance of that database, which should always be avoided. When defining your view, try to split unrelated views into different design documents.

In a production environment, new views or full updates are sometimes inevitable due to business needs. If you have a large amount of data, to avoid long occupation of database resources which will slow down the entire system, you should not directly create or update the view in the production environment of Cloudant.

One way is to create a backup of the production database in a test environment, and establish a continuous replication. Then create or update the view in the backup database and wait until it is done. The backup database then needs to be synchronized back to the production Cloudant instance. For applications that require extremely fast response time, incremental updates to views can also have an effect on database performance because Cloudant maintains three copies of a view to increase query performance and it has to maintain the consistency of them. In fact, Cloudant provides some parameters (such as stable, stale, update) for the user to decide whether to receive stale data from only one fragment, whether to accept stale data, or whether to refresh the view in real time. Users are required to select proper parameters among these three options according to their own requirements. Cloudant official documents contain relevant introduction in detail, which will not be discussed here.

Map function

The view definition in Cloudant mainly includes two parts, the map function and the reduce function. The map function's input is documents. Within the function, we can read the value of each field in a document. After adding some relevant business logic, we can feed the converted document to reduce function through the emit function. Due to the fact that views will create relevant index by default and update the data in real time, some relatively complex data-level business logic can be placed in the map function, so that data processing can be more efficient.

When using the emit function, which is used to output the result of the map function, it is important to note that we should try to output only the necessary fields, rather than those useless fields or the entire document, to speed up view creation and refresh, and the view-based query. Cloudant official document provides a clear and understandable introduction to map function; we will not go into details here.

Reduce function

The introduction to reduce function in Cloudant official document is much too brief, and lacks auxiliary instructions with practical examples. Thus, it is hard for new users to implement complex reduce function independently. Therefore, an example will be given in this section to explain the features of reduce function and its specific implementation and underlying principle.

Reduce functions are passed three arguments, "keys", "values", and "rereduce". Often, reduce functions must handle two cases:

When rereduce is false:

  • "keys" is an array whose elements are arrays of the form [key, id], where key is a key emitted by the map function, and id identifies the document from which the key was generated.
  • "values" is an array of the values emitted for the respective elements in keys.

When rereduce is true:

  • "keys" is null.
  • "values" is an array of values returned by previous calls to the reduce function

Following is a specific example. This example assumes that documents stored in the database are files created by the users in a system. Each file is placed in its corresponding folder. Therefore, each record contains user_id, folder_id, file_id, and information like description and created_timestamp. Our goal is to create a query via the view's MapReduce function, which returns all file_id's by specifying certain folder_id and user_id (this feature can also be realized by other means in Cloudant, the following example is customized just for demonstrating the MapReduce function).

Example 2. MapReduce example – Map function

function (doc) {
  if((doc.obj_type === "file")
      && doc.file_id && doc.folder_id){
        var user_id = doc.user_id;
        if(user_id){
          emit([user_id, doc.folder_id], doc.file_id);
        }   
	}     
}

Example 2 is an implementation of map function. It converts each document into a key-value pair where key is a list composed of user_id and folder_id, and value is file_id.

Example 3. MapReduce example – Reduce function

function (keys, values, rereduce) {
  if (rereduce) {
    var Rresult = "";
    for(var j in values){
      var sub_value = values[j];
      var sa = values[j].split(",");
      for(var m in sa){
        if (Rresult.indexOf(sa[m])<0){
          if(Rresult == ""){
          Rresult =  sa[m];
        }else{
          Rresult = Rresult + "," + sa[m];
        }
        }
      }
    }
    return Rresult;
  } else {
    var result ="";
    for(var i in values){
      if(result.indexOf(values[i])<0){
        if(result == ""){
          result =  values[i];
        }else{
          result = result + "," + values[i];
        }
      }
    }
    return result;
  }
}

Example 3 is an implementation of reduce function. The code handles each case respectively when the rereduce is true or false. Detailed implementation process is shown in Figure 1. Step 1, execute the procedure, in which case the rereduce is false, that is, process the data at the L0 layer in Figure 1. The data at layer L0 is the output of our map function, also the input of the reduce function here. The input data at layer L1 is the output of layer L0, where the rereduce is false. In this process, values which share the same key will be concatenated into a string, with commas as the delimiter. For example, for the leftmost key-value pair at layer L1, the value is like "File1, File2, File3". After that, codes in the case where rereduce equals to true will be implemented iteratively. That is, we split the input strings with the same key by commas at layer L1, remove the duplicates, then concatenate them together separated by commas. The new string will be sent to layer L2. If there are more than one parent node at layer L2, repeat this process.

Figure 1. The Reduce function procedure

Based on the previous business logic of MapReduce function, Cloudant will preprocess relevant data, then send the outcome of map and reduce function to the B-tree of Cloudant index. Later on, if you execute a query based on this view, the related data will be returned quickly. In the use of view-based queries, two parameters, group and group_level, are frequently used.

group and group_level in view-based queries

Cloudant provides a number of optional parameters for view-based queries, such as starting and ending key values, passing multiple key values at once, specifying whether to execute the reduce function, or limiting or skipping some records. Among these parameters, the most powerful and error-prone parameters are group and group_level.

When a view-based query has no parameters, if reduce functions was defined in the view, the result is the result of the final output of the reduce function. In the example of the previous section, if you directly execute a view-based query with no parameters, the result will be like what is shown in Example 4.

Example 4. View-based query results with no parameter

Method: GET /$DATABASE/_design/$DDOC/_view/$VIEW-NAME
Request: None
Response:
{
    "rows": [
        {
            "key": null,
            "value": "F1,F2,F3,F4,F5"
        }
    ]
}

The result is a collection of all value strings connected by commas. The key returned in the result is null because the reduce function executed at the last step is when the rereduce equals to true. But our purpose is to get all file_id items grouped by user_id and folder_id. Therefore, the result does not match our needs. The parameter "group" equals false by default. If you include this parameter in your query, specifying it to be true, the result will look like Example 5.

Example 5. View-based query results with parameter group

Method: GET /$DATABASE/_design/$DDOC/_view/$VIEW-NAME?group=true
Request: None
Response:
{
    "rows": [
        {
            "key": ["U1","Fo1"],
            "value": "F1,F2,F3,F4,F5"
        },
        {
            "key": ["U2","Fo1"],
            "value": "F1,F2,F3,F4"
        },
        {
            "key": ["U2","Fo2"],
            "value": "F1,F3,F4"
        }
	]
}

The result will be grouped by the key output of the map function, then the reduce function performs the operation on value within each group. On this basis, if we pass in the specified key value, which is a combination of user_id and folder_id, the result of the query will only return the value corresponding to the key. Till now, we almost meet the needs mentioned previously.

Two prerequisites are required when using the parameter group_level. One is that the parameter group should be equal to true. The other is that the key must be a composite key. The so-called composite key refers to the key needing to be an array rather than a single value. The key used in the previous example is a composite key consisting of user_id and folder_id. When both of these premises are met, group_level can be specified as any integer value ranging from 1 to the length of the key array. The group_level value of n means that reduce will only group the value by the first n elements of the key array. Based on the previous example, Example 6 shows the query with group_level being 1:

Example 6. View-based query results with parameter group_level being 1

Method: GET /$DATABASE/_design/$DDOC/_view/$VIEW-NAME?group=true&group_level=1
Request: None
Response:
{
    "rows": [
        {
            "key": ["U1"],
            "value": "F1,F2,F3,F4,F5"
        },
        {
            "key": ["U2"],
            "value": "F1,F2,F3,F4"
        }
	]
}

Since we specified group_level as 1, the key array in the result contains only the first element of the composite key, user_id. Values that share a common user_id will be grouped together, regardless of the folder_id.

When group_level is 2, the result of the query is the same as that in Example 5. That is, when the parameter group is specified as true, the default value of group_level is the length of the key array.

In this case, readers may wonder how to group values only by the folder_id key. Currently, Cloudant's query interface does not support this kind of assignment. If you want to group values only by folder_id, one alternative is to modify the map function of this view or add a new view, so that the parameter folder_id is placed at the leftmost position of the composite key.

Using views to efficiently manipulate document references

It is not recommended to establish a variety of references between documents in NoSQL database, rather, we recommend storing the data in the referred documents in the form of redundant information. Even though, some complicated business logic still requires document references. Once an id-like field appears in a document, an additional query is usually required to get the data in the referred documents by using this id.

To improve the efficiency of the query on a referenced document, Cloudant view supports simultaneous return of the original document and the referenced document with only one single query. In the map function, the final output of the value contains "_id" field, and specifies the corresponding document id value. So, when querying the view, as long as the parameter include_docs is specified as true, the document corresponding to the specified id will be retrieved and returned. Here is an example. Example 7 is three pieces of data in the database.

Example 7. Source data of a sample of view optimization for document reference query

{"_id":"2sof3204234u","node_name":"Node 1","parent":"3sldfjsla"}
{"_id":"5ladsfjldd","node_name":"Node 2","parent":"3sldfjsla"}
{"_id":"3sldfjsla","node_name":"Node 3","parent":""}

Node 1 and Node 2 refer to the id of Node 3 as their parent. We now want to find the relevant parent when we query each node. First, we need to create a new view. The map function is shown in Example 8.

Example 8. Using map function to process inter-document reference

function(doc) {
    if (doc.parent) {
        emit(doc.node_name, { "_id": doc.parent });
    }
}

The reduce function for this view can be left blank. In the map function, value of the final output is an object, which contains only one id attribute, here we pass in the value of its parent attribute, which is its parent's id. Then we query data based on this view, the result is shown in Example 9.

Example 9. The query results after using view to process inter-document reference

Method: GET /$DATABASE/_design/$DDOC/_view/$VIEW-NAME?include_docs=true
Request: None
Response:
{  
   "total_rows":3,
   "offset":0,
   "rows":[  
      {  
         "id":"2sof3204234u",
         "key":"Node 1",
         "value":{  
            "_id":"3sldfjsla"
         },
         "doc":{  
            "_id":"3sldfjsla",
            "node_name":"Node 3",
            "parent":""
         }
      },
      {  
         "id":"5ladsfjldd",
         "key":"Node 2",
         "value":{  
            "_id":"3sldfjsla"
         },
         "doc":{  
            "_id":"3sldfjsla",
            "node_name":"Node 3",
            "parent":""
         }
      },
      {  
         "id":"3sldfjsla",
         "key":"Node 3",
         "value":{  
            "_id":""
         },
         "doc":null
      }
   ]
}

In the result returned above, the field "doc" contains the content of the referenced document. Node 1 and Node 2 both refer to Node 3's id. By using views, Node 1 and Node 2 contain Node 3's document data when being returned.

However, this kind of processing has certain requirements on the format of the document: the documents where the reference relationship exists should have the same or similar structure. If the structure is different, this will lead to the referenced document being filtered out, thereby not able to be included in the returned results. Example 10 modified the source data in Example 7 by removing the parent field from Node 3.

Example 10. Source data of a sample of view optimization for document reference query (modified version)

{"_id":"2sof3204234u","node_name":"Node 1","parent":"3sldfjsla"}
{"_id":"5ladsfjldd","node_name":"Node 2","parent":"3sldfjsla"}
{"_id":"3sldfjsla","node_name":"Node 3"}

The definition of the map function stays the same as in Example 8. When processing this data, since Node 3 does not contain a parent field, it is excluded by the map function, causing reference failure when data from other nodes being output. The final result is shown in Example 11.

Example 11. The query results after using view to process inter-document reference (after being modified)

Method: GET /$DATABASE/_design/$DDOC/_view/$VIEW-NAME?include_docs=true
Request: None
Response:
{  
   "total_rows":2,
   "offset":0,
   "rows":[  
      {  
         "id":"2sof3204234u",
         "key":"Node 1",
         "value":{  
            "_id":"3sldfjsla"
         },
         "doc":null
      },
      {  
         "id":"5ladsfjldd",
         "key":"Node 2",
         "value":{  
            "_id":"3sldfjsla"
         },
         "doc":null
      }
   ]
}

This section mainly elaborates on the indexes and reduce functions in the view. By combining the examples and the data on the Cloudant official website, readers can have a deeper understanding of how to create complex views.

Other optimizations related to Cloudant queries

Settings for tuning request and retry timeout values

Because Cloudant is a cloud-based service, the nodes where the Cloudant instance runs are typically shared with other instances. Therefore, it is inevitable that there will be resource competition and mutual influence between the databases in the same node. For example, if an instance of a node heavily consumes the node resources for a short period of time, other instances in the same node are bound to be affected, causing requests for data not to be responded to for a fairly long time.

In order to avoid a request delay or processing failure of our own instance caused by the instantaneous resource consumption of other instances in the same node, we need to enable the request timeout and retry mechanism of Cloudant. When the request is found to be unable to be returned in time, the request needs to be timed out by the timeout mechanism. Then, use the retry mechanism to resend the request. The request timeout and retry times need to be set according to the specific performance requirements after tuning the connection between the application system and the Cloudant environment multiple times.

In the software development kit provided by Cloudant, the request timeout and retry is disabled by default. Therefore, developers are required to explicitly specify it. Taking NodeJS as an example, Example 12 shows how to set the request timeout.

Example 12. An example of setting Cloudant request timeout in NodeJS

var Cloudant = require('cloudant');
var cloudant = Cloudant({account:me, 
password:password,
requestDefaults: { "timeout": 5000 }
});

In the Cloudant initialization function, create a requestDefaults property and specify the timeout as 5000 milliseconds. After the function is initialized, it will establish a connection to the Cloudant database. All requests sent through this connection will have their timeout set to 5 seconds.

Note that the Cloudant initialization function supports passing of a plugin called retry, which also supports setting the timeout and retry times but they are different from the concepts above. One of Cloudant DBaaS's pricing details specifies the maximum number of concurrent queries per second at certain price. Queries that exceed this maximum will be rejected with an HTTP 429 message returned. The retry plug-in can only set query retry times and timeout under this circumstance. By default, queries that exceed the maximum will be retried three times, with a timeout of 500 milliseconds.

Cloudant view and Join operation

In a traditional relational database, if there is a foreign key association between two tables, you can get the information from the two tables by one piece of query with join operation. However, in Cloudant, if data of two different document types are related, it is generally required to obtain the data of these two document types with two HTTP query requests.

Example 13. Sample data of Cloudant view and join operation

{
	"id":"file1",
	"name":"how to learn JS",
	"user_id":"user1",
	"doc_type":"file"
},
{
	"id":"file2",
	"name":"Thinking in Java",
	"user_id":"user1",
	"doc_type":"file"
},
{
	"id":"user1",
	"name":"John",
	"doc_type":"user"
}

For the data in Example 13, if you want to find out the user's information and its associated file information, usually two queries are required, one for the user information, and the other for the file information. However, by using a Cloudant view, only one query is needed to complete this target. Example 14 shows the definition of this view, which only needs the map function.

Example 14. Using a view to process join operation

function(doc){
	if(doc.type=="user"){
		emit([doc.id,0],doc)
	}else if(doc.type=="file"){
		emit(doc.user_id,1),doc
	}
}

In this function, a compound key is used, the first element is the id of the user. The second element differs under two circumstances. When the type of the document is user, the second element is 0; When its type is file, the second element is 1. The corresponding value of the composite key, then contains a specific document's information.

By default, Cloudant's view-based query results will be sorted by key. To be specific, ["abc", 2] comes after ["abc"] and ["abc", 1], but precedes ["abc", 2, "xyz"]. Based on this rule, we can achieve user information and file information of a certain user via one query. This is done by specifying startkey and endkey. For example, if you want to find user information and file information of a user whose id is "abc", you need to specify startkey as ["abc"] and endkey as ["abc", 2].

Using HTTP persistent connection or connection pool

In the software development kit provided by Cloudant, the application-level protocol is the HTTP protocol when establishing a connection between the client and Cloudant. The number of requests from a client application to Cloudant can be huge in a short period of time. Enabling HTTP persistent connection reuses the connection between the client and Cloudant, avoiding the process of establishing a connection each time a request is made, thereby speeding up the query to a certain extent. Example 15 shows how to enable a HTTP persistent connection in Node.js. You need to first install the agentkeepalive module using npm.

Example 15. Enabling a HTTP persistent connection in Node.js

// create custom HTTPS agent
var HttpsAgent = require('agentkeepalive').HttpsAgent;
var myagent = new HttpsAgent({
  maxSockets: 50, 
  maxKeepAliveRequests: 0,
  maxKeepAliveTime: 30000
});

// Setup Cloudant connection
var cloudant = require('cloudant')({ 
  url: 'https://myusername:mypassword@myaccount.cloudant.com', 
  "requestDefaults": { "agent" : myagent }
});

Noted that, in some clients (such as the Apache Http Client in Java), HTTP persistent connection is achieved through the HTTP connection pool, and its role is equivalent.

Summary

As a representative document database and NoSQL database, Cloudant differs from traditional relational databases in terms of usage, especially in terms of data query and optimization. The methods mentioned in this article do not cover every situation and need to be combined with specific data and usage scenarios. As data volume increases and data complexity increases, users will have higher requirements for Cloudant's query efficiency, also, ways to optimize Cloudant queries will increase. We also encourage further discussion on this topic among our readers, so as to make queries in Cloudant more efficient.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Data and analytics
ArticleID=1061798
ArticleTitle=Optimize queries in Cloudant
publish-date=06112018