Exploring CouchDB

A document-oriented database for Web applications

Relational databases define a strict structure and provide a rigid way to maintain data for a software application. Apache's open source CouchDB offers a new method of storing data, in what is referred to as a schema-free document-oriented database model. Instead of the highly structured data storage of a relational model, CouchDB stores data in a semi-structured fashion, using a JavaScript-based view model for generating structured aggregation and report results from these semi-structured documents. CouchDB has been developed from the ground up with Web applications as the primary focus and has its sights on becoming the de-facto database for Web application development.

Joe Lennon, Software developer, Core International

Joe Lennon is a 23-year-old software developer from Cork, Ireland. Joe currently works as a Web application and Oracle PL/SQL developer for Core International, having graduated from University College Cork in 2007 with a degree in Business Information Systems. He lives in Cork with his girlfriend, Jill.


developerWorks Contributing author
        level

31 March 2009

Also available in Japanese Portuguese

What is CouchDB?

CouchDB is an open source document-oriented database-management system, accessible using a RESTful JavaScript Object Notation (JSON) API. The term "Couch" is an acronym for "Cluster Of Unreliable Commodity Hardware," reflecting the goal of CouchDB being extremely scalable, offering high availability and reliability, even while running on hardware that is typically prone to failure. CouchDB was originally written in C++, but in April 2008, the project moved to the Erlang OTP platform for its emphasis on fault tolerance.

CouchDB can be installed on most POSIX systems, including Linux® and Mac OS X. Although Windows® isn't currently officially supported, work is under way on an unofficial binary installer for the Windows platform. CouchDB can be installed from source or, where available, it can be installed using a package manager (e.g., MacPorts on Mac OS X).

CouchDB is a top-level Apache Software Foundation open source project, released under V2.0 of the Apache license. This open source license allows the source code to be used and modified for use in other software, so long as the copyright notice and disclaimer are preserved. Like most open source licenses, it allows the software to be used, modified, and distributed by users as required. Any modifications do not have to be made available under the same license, so long as a notice of use of Apache-licensed code is maintained.


Differences between a document-oriented and a relational database

For many, the concept of a document-oriented database-management system is difficult to grasp at first, especially if they have been working with relational database-management systems for a long time. The reason behind this is that similarities in the two models are few and far between.

A document-oriented database is, unsurprisingly, made up of a series of self-contained documents. This means that all of the data for the document in question is stored in the document itself — not in a related table as it would be in a relational database. In fact, there are no tables, rows, columns or relationships in a document-oriented database at all. This means that they are schema-free; no strict schema needs to be defined in advance of actually using the database. If a document needs to add a new field, it can simply include that field, without adversely affecting other documents in the database. This also documents do not have to store empty data values for fields they do not have a value for.

The forthcoming book CouchDB: The Definitive Guide (see Resources), uses the example of a business card as a "real-world document" and describes how this would be described in a document-oriented database as opposed to a relational one. In the relational database, you could have four or more tables to house this data: one for "Person," one for "Company," one for "Contact Details," and one for the business card itself. These would all have strictly defined columns and keys, and the data would be assembled using a series of joins.

While this provides the advantage of having a single point of truth for each piece of data, it makes it rigid and difficult to modify at a later stage if required. It also means that the record cannot be adapted for different circumstances. For example, one person may have a fax number, while another may not. On a business card, you would not show "Fax: None," you would just simply not show any fax details.

In a document-oriented database, each business card would be held in its own document, each of which can define the fields it wishes to use. So the person without a fax number does not need to define a fax value, whereas the person with a fax number is free to do so as they please.

Another difference between these types of databases is in the storage of unique identifiers. It is common for relational databases to use the concept of primary keys, generated by an auto-increment feature or by a sequence generator. Of course, these identifiers are only unique for the table or database they are used on — they can be reused by other tables and databases. If an update operation is made at the same time on two databases on separate networks, they cannot both accurately retrieve the next unique identifier. CouchDB does not come with an auto-increment or sequence feature. Instead, it assigns a Universally Unique Identifier (UUID) to each and every document, making it almost impossible for another database to accidentally select the same unique identifier.

Another key difference between document-oriented and relational databases is that document-oriented databases do not support joins. This is a consequence of there being no primary and foreign keys in CouchDB. There are no keys to base joins on. This does not mean you cannot retrieve a set of related data from a CouchDB database. A feature called a view allows you to create an arbitrary relation between documents that is not actually defined in the database itself. This means you can get all the benefits of typical SQL join queries without the burden of predefining their relationships in the database layer.

It is important to note that although document-oriented databases operate in a different manner to relational databases, it does not mean that they are a viable replacement. CouchDB does not set out to offer a replacement for relational databases, but, rather, offers an alternative for those projects where a document-oriented model is a better fit than a traditional relational database, such as wikis, blogs and document-management systems.


How CouchDB works

CouchDB is built on a powerful B-tree storage engine, which is responsible for keeping the data in CouchDB sorted and provides a mechanism for searching, inserting, and deleting in logarithmic amortized time. CouchDB uses this engine for all internal data, documents, and views.

Because of the schema-free manner in which the database is structured, CouchDB is dependent on the use of views to create arbitrary relationships between documents, and to provide aggregation and reporting features. The results of these views are computed using Map/Reduce, a model for processing and generating large data sets using distributed computing. The Map/Reduce model was introduced by Google, and can be broken up into the Map step and the Reduce step. In the map step, the document is taken by the master node and the problem is divided into subproblems. These subproblems are then distributed to worker nodes, which solve the problem and return the results to the master node. In the reduce step, the master node takes the results received from the worker nodes and combines them to get an overall result and answer to the original problem.

The Map/Reduce functions in CouchDB produce key/value pairs, allowing CouchDB to insert them into the B-tree engine, sorted by their keys. This allows for ultra-efficient lookups by key and enhances the performance of operations within the B-tree. In addition, this also means that the data can be partitioned over many nodes without interfering with the ability to query each node individually.

Traditional relational database management systems sometimes use locking to manage concurrency, preventing clients from accessing data while another client is updating that data. This prevents multiple clients from making changes to the same set of data at the same time, but in situations where there are many clients using the system concurrently, it is quite common that the database can get bogged down in sorting out which client should receive the lock and maintaining the order of the lock queue. In CouchDB, there is no locking mechanism, instead it uses a method referred to as Multiversion concurrency control (MVCC) — where each client is provided with a snapshot of the latest version of the database. This means that no changes are seen by other users until the transaction has been committed. Most modern databases have started to move from locking mechanisms to MVCC, including Oracle (since V7), MySQL (when used with InnoDB) and Microsoft® SQL Server 2005 and later.


Documents: The building blocks of a CouchDB database

CouchDB databases store uniquely named documents and provide a RESTful JSON API that allow applications to read and modify these documents. All data in a CouchDB database is stored in a document, and each document can be made up of an undefined number of fields. This means that each document can have fields that are not defined in other documents. In other words, the documents aren't bound to a strict database schema.

Each document also contains metadata (data about the data), such as a unique document ID and revision numbers. Document fields can contain various types of data, such as text strings, numbers, boolean (true/false) values, etc. Fields are not bound by a size limit. Each field must be given a unique name (documents cannot have two fields with the same name).

When changes are made to a CouchDB document, the changes are not actually appended to the existing document, but rather, a new version of the entire document, called a revision, is created. This means that a full history of document modifications is maintained automatically by the database. The document-revision system works in much the same way as a wiki or Web-based document-management system manages revision control, except that it is all handled automatically at the database level.

CouchDB does not feature locking mechanisms; two clients can load and edit the same document at the same time. However, if one client saves their changes, the other client will receive an edit conflict when they try to save back their changes. This conflict can be resolved by loading the newer version of the document, reapplying the edits and trying to save again. CouchDB maintains data consistency by ensuring that document updates are all or nothing — it either works or it fails. There will never be a partially saved document in the database.


Views: Getting useful information out of CouchDB

CouchDB is unstructured in nature, and while its lack of a strict schema provides benefits in terms of flexibility and scalability, it can make be quite difficult to actually use in real-world applications. When you think of a relational database, you can see that for everyday applications, the relationship between strictly defined tables is important to give meaning to the data. However, when high performance is required, materialized views are created to de-normalize the data. In many ways, the document-oriented database does things the other way around. It stores its data in a flat address space, much like a completely de-normalized data warehouse. It then provides a view model to add structure to the data so it can be aggregated to provide useful meaning.

Views in CouchDB are created on demand and can be used to aggregate, join, and report on documents in the database. They are built dynamically and have no effect on the documents in the database. Views are defined in design documents and can be replicated across instances. These design documents contain JavaScript functions that run queries using the concept of MapReduce. The view's map function takes the document as an argument and performs a series of computations to determine what data should be available via the view. If a view has a reduce function, it is used to aggregate the results. It is passed a set of keys and values, and it combines them to a single value.

Listing 1 is an example of the map and reduce functions of a CouchDB view that counts the number of documents that have an attachment.

Listing 1. A typical CouchDB view
map: function(doc) {
  if (doc._attachments) {
    emit("with attachment", 1);
  }
  else {
    emit("without attachment", 1); 
  }
}
reduce: function(keys, values) {
   return sum(values);
}

CouchDB views can be permanent views stored inside design document, or temporary views executed on demand. Temporary views are resource-intensive and become slower as the amount of data stored in the database increases. For this reason, CouchDB views should, for the most part, be created in design documents.


The RESTful JSON API

CouchDB offers an API as a means to retrieve data from the database. This API is accessible via HTTP GET and POST requests, and returns data in the form of JavaScript objects using JSON. This makes it easy to perform database operations, whatever language your application is developed in. You could simply use a JavaScript framework with Ajax request objects, such as Prototype, JQuery or ExtJS — there's no need for a server-side language for your Web applications.

For the sake of simplicity and to illustrate the raw JSON responses issued by the API, we will use the command-line tool, curl, in this article. This allows you to issue GET, POST, PUT, and DELETE requests, and displays the raw HTTP response received from the Web server (in this case a locally installed CouchDB server).

The request curl http://127.0.0.1:5984/ returns the following response: {"couchdb":"Welcome","version":"0.8.1-incubating"}. This simple API call is a simple GET request, with the response informing us of the version of CouchDB installed. To explicitly define the type of request being made, we will use the -X parameter of curl, as in the following request: curl -X GET http://127.0.0.1:5984/_all_dbs. This returns the following result: [].

In this example, we have requested a URI of a special CouchDB view that returns a list of all databases on the CouchDB server. If we had actually created any databases, this would have returned an array of database names. In this instance, it returned an empty JavaScript array. Now we will create a couple of databases, and the next time we run this request, we should see a different result: curl -X PUT http://127.0.0.1:5984/fruit.

The response received is: {"ok":true}. We then issue a second request: curl -X PUT http://127.0.0.1:5984/vegetables and receive the same response as before. Now we're going to request a list of databases once again: curl -X GET http://127.0.0.1:5984/_all_dbs. This time around, we get a better result than before: ["fruit","vegetables"].

When we created these databases, they returned with an attribute "ok" with a Boolean value of true. This indicates that the operation was successful. But what happens if things don't go according to plan? To try and trip up the CouchDB server, let's try to create a database with the same name as one of the existing databases: curl -X PUT http://127.0.0.1:5984/fruit.

This time around, we get the following response: {"error":"database_already_exists","reason":"Database \"fruit\" already exists."}.

As you can see, CouchDB has tried to create the database, but encountered an error in the process. As a result it has returned an attribute "error" with the value of the error code it encountered — in this case, "database_already_exists." Of course, in a real-world application, we would perform a check for the existence of an error attribute in all responses from the CouchDB server and display a user-friendly error message based on the error code found.

Let's say we no longer need the vegetables database and we want to remove it. To delete a database in CouchDB, we simply issue a DELETE HTTP request with the name of the database appended to our base URI, as in the following example: curl -X DELETE http://127.0.0.1:5984/vegetables. This gives us the same successful response as the PUT requests performed earlier. Now we will retrieve a list of databases using: curl -X GET http://127.0.0.1:5984/_all_dbs. This gives us the following response: ["fruit"].

What good is a database if it doesn't have any data? The request in Listing 2 creates a document called "apple."

Listing 2. Creating a document called apple
curl -X PUT http://127.0.0.1:5984/fruit/apple \
-H "Content-Type: application/json" -d {}

The server responds with the following: {"ok":true,"id":"apple","rev":"1801185866"}.

Now that we have created a document, let's retrieve it back from the database: curl -X GET http://127.0.0.1:5984/fruit/apple. CouchDB responds with the following: {"_id":"apple","_rev":"1801185866"}.

The final API call we'll make is one that retrieves information about a particular database — in this instance, the fruit database: curl -X GET http://127.0.0.1:5984/fruit. The response received from the server tells us some interesting information about the database.

Listing 3. Response from the server
{"db_name":"fruit","doc_count":1,"doc_del_count":0,"update_seq":1,
"compact_running":false,"disk_size":14263}

In this section we have explored several of the various API methods available to us via the RESTful JSON API interface that CouchDB provides. In the real world, we would not be hand-coding these HTTP requests, but our programming or scripting language of choice would do this for us.

CouchDB also includes a Web application called Futon, which can be used as a CouchDB administration tool, allowing you to maintain databases, documents and document revisions to your heart's content. If you have installed CouchDB on your local machine to the default 5984 port, you can access Futon by pointing your browser to: http://127.0.0.1:5984/_utils/. Figure 1 illustrates Futon in action.

Figure 1. The CouchDB Futon utility in action
The CouchDB Futon utility in action

The RESTful API provided by CouchDB may be daunting at first for those used to relational database-management systems, but it provides a unique and extremely accessible means of interacting with the database. Traditional database systems typically required a database connection to be made using some form of SQL client, which would then accept a series of SQL statements to retrieve data and to perform Create, Update, Delete (CRUD) operations. Thanks to the RESTful JSON API, users can connect to CouchDB using any software that supports the HTTP protocol. Indeed, most modern programming and scripting languages provide some form of interface to the HTTP protocol, meaning that CouchDB can be used in almost any development project.


Summary

It's still the early days as far as the Apache CouchDB project is concerned. CouchDB is still very much considered alpha software. With that said, CouchDB is becoming increasingly popular in Web applications, iPhone applications, and Facebook applications. Up until now, powerful wiki, blog, discussion forum and document-management software have worked around relational databases to make them store document-oriented data as efficiently as possible. As more stable releases of CouchDB become available, however, it becomes a more attractive proposition as the underlying database option for these types of software, taking the pain out of document-revision management and continually changing schema requirements.

Generally speaking, the reaction to CouchDB until now has been mostly positive, although many repeatedly feel the need to argue on blogs and forums as to which is better — relational or document-oriented. The simple fact of the matter is CouchDB never claimed to be a replacement for relational databases or that it was going to become the new standard for database development. Of course, in many scenarios the, pure simplicity of CouchDB cannot match up to the power offered by the likes of DB2 and Oracle. But in many other scenarios, database simplicity is exactly what is needed, and traditional RDBMS offerings are far too bloated and resource-intensive for the job at hand.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Web development
ArticleID=378719
ArticleTitle=Exploring CouchDB
publish-date=03312009