Predictive analytics, business intelligence, and data mining in general all require the storage and processing of complex and often wildly different data structures as the information is processed, resolved, and summarized. It is highly likely, particularly for business and financial information, that a significant amount of that data comes from relational databases. These follow a strict structure and require a significant amount of preparation in terms of designing your schema and data model beforehand. The new breed of NoSQL and document-based databases make much of this processing simpler because you can create and dump information in a flexible format. Additionally, you can work on methods to extract that data in the fixed format you require. In this article, I look at how to use document-based databases for data processing and analytics as part of your overall database solution.

Martin Brown, VP of Technical Publications, Couchbase

author photoA professional writer for over 15 years, Martin 'MC' Brown is the author and contributor to over 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft® WP, Mac OS and more. He is a former LAMP Technologies Editor for LinuxWorld magazine and is a regular contributor to ServerWatch.com, LinuxPlanet, ComputerWorld, and IBM developerWorks. He draws on a rich and varied background as founder member of a leading UK ISP, systems manager and IT consultant for an advertising agency and Internet solutions group, technical specialist for an intercontinental ISP network, and database designer and programmer and as a self-confessed compulsive consumer of computing hardware and software. MC is currently the VP of Technical Publications and Education for Couchbase and is responsible for all published documentation, training program and content, and the Couchbase Techzone.



12 February 2013

Also available in Russian Vietnamese Portuguese

Document database architecture

One of the key elements of all document databases is that they can handle and work with much larger structures and datasets than normal. In particular, because of their distributed nature and the different way in which they store data physically, they are ideal where there is a vast quantity of data to be processed, as is often the case in data mining.

Those benefits are obvious and documented elsewhere; this article focuses on the structure and format of the information and the techniques that are used to process and report on the stored information.


Flexible data structure

Document databases have an (almost) infinitely flexible structure that provides a number of different key areas of functionality:

  • Schema-less: Document databases do not have to predefine the structure of the data that must be stored into the database. In a traditional RDBMS, you specify the structure of the tables in which the data is stored and try to predict the content, possible values, and structure of the information. With a document database, you can store information into the documents without having to worry about the structure, whether you have multiple fields, or even in most cases what the one-to-many and many-to-many relationships are. You can instead concentrate on the content of the information itself. This can make it much easier to store the raw material and information, even though it can be coming from disparate sources. More flexibly, it also means that you can combine and process information of different types and structures. For example, processing textual data is difficult to achieve with a traditional RDBMS because you need to ensure that the structure (number of sentences, paragraphs, etc.) is flexible enough to support the incoming information. More explicitly, imagine collating the data from Twitter, Facebook, and other social sources and looking for patterns. Twitter information has a fixed length with the information embedded into a single, small string. Facebook does separate elements of the information out (text, location, and individuals). Collecting that information and unifying it would take a significant amount of processing to place it into a rigid structure.
  • Logical objects: Most RDBMS solutions are used to model information that might normally be in a (relatively) structured format. Then SQL and joins are used to mold that information into an object that is used internally. You might individually look at different elements of the overall data structure, but often the information is combined and reported on according to the object that collates all of the data.

    From a more complex perspective, we often slice and dice the different elements of the data in different ways, even though in reality we are still just picking out elements of that overall structure. The document structure changes that perspective. Rather than looking at individual, discrete data points, documents can look at objects as a whole. Tracking information about data collectors, for example, might require all of the information about that object in place, even though different data collectors might have different sensors, different numbers of sensors, and different levels of complexity.
  • Migratory Structure: Data changes over time, sometimes slowly and sometimes quickly. Modifying a data structure is a complex process, which not only affects the database that you use, it also requires changes to the applications that access and use that information. With a document-based structure, because the structure of the data is fixed, adapting that structure for new versions and different formats of the original data is difficult and complex. Either you need to create a new table or modify the existing table to cope with the new structure, and that means converting all of the previously created records to match the new structure. With a document database, the structure of the documents can be modified. In fact, the structure of individual documents can be different from each other. Because you are always dealing with entire documents, your application is unlikely to need changes to cope until it needs to process the new data.

Given all of this, what does it mean in practice to collect, extract, and process that information?

The first thing to consider is the format of the data itself. Document databases can store any information, but the most widely used structural format is JSON, an object notation format from the JavaScript language. It allows you to store strings, numbers, arrays, and record (hash) data, and combinations of those core types.

For the purposes of understanding the basics of document processing, this article uses fairly simple data. Listing 1 is a document that tracks the water level and temperature from an external water tank:

Listing 1. Document tracking water level and temperature for an external water tank
{
  "datestring": "Tue Nov 30 01:40:30 2010",
  "hour": 1,
  "min": 40,
  "waterlevel": 96,
  "day": 30,
  "mon": 11,
  "year": 2010,
  "temperature": "28.64"
}

The individual date components have been stored for flexibility, although they are entirely optional. The water level and temperature are stored as the raw values.

A different logger tracks the temperature of a water tank at three different points and represents its data as a hash of the different values (see Listing 2):

Listing 2. Document tracking temperature of tank at three different points
{
  "datestring": "Tue Nov 30 02:06:21 2010",
  "temperature": {
    "mid": 23.2148953489378,
    "top": 23.6984348277329,
    "bot": 23.0212211444848
  }
}

Actually getting the data into a document database such as Hadoop or Couchbase Server is probably the easiest part of the process. By comparison, there is no processing or construction of the data or structure to hold it. We don't need to analyze the data to identify a structure, we can just store the data in its raw form.

It is the processing during extraction that makes document-based data mining powerful.


Exchanging data

If you have existing data within a traditional RDBMS, such as IBM DB2, then you can use a document database to simplify and more normally unify different data into documents that can be processed by a document database to take advantage of the unifying format.

You might think it's wrong to perform this operation: If it's in a database already, why move it? But RDBMS solutions have been used for many years to store textual information and different versions and revisions of tabular data. A document database can be an effective way of unifying this into a structure that can be using map/reduce and other techniques.

The most straightforward process is to load your objects as they are formatted and structured within the database. This is easy if you are using an ORM system to model your data into an object. Outside of that, you can perform the process by hand. The script in Listing 3 performs the operation by taking a complex component record loaded through a function that compiles the individual SQL statements to generate an internal object, formatting for JSON and then writing to a document database (in this case, CouchDB):

Listing 3. Operation to load your objects
foreach my $productid (keys %{$products})
{
    my $product = new Product($fw,$productid);

    my $id = $product->{title};
    $id =~ s/[ ',\(\)]//g;
    my $record = {
        _id => $id,
        title => $product->{title},
        componentcount => $product->{componentcount},
        buildtime => $product->{metadata_bytag}->{totalbuildtime},
        testtime => $product->{metadata_bytag}->{totaltesttime},
        totaltime => $product->{metadata_bytag}->{totaltime},
        keywords => [keys %{$product->{keywordbytext}} ],
    };

    foreach my $component (@{$product->{components}})
    {
        push(@{$record->{component}},

                 componentqty => $component->{'qty'},
                 component => $component->{'componentdesc'},
                 componentcode => $component->{'componentcode'},
             }
            );
    }

    my $req = HTTP::Request->new('POST' => $base);
    $req->header('Content-Type' => 'application/json');
    $req->content(to_json($record));

    my $res = $ua->request($req);

}

You can use similar processes with other information and document databases. In Hadoop, for example, you might create a new single file for each product record.

When combining information from multiple tables into a single format for processing, although you don't have to use the same field names (that can be resolved at processing time), there is no reason not to at least standardize on some of the fields (dates, data points) if the information is broadly the same.

During processing, as in the example code, you might also want to perform some pre-processing and formatting on the information. For example, you can harmonize the data to use the same measurement points, or combine fields that were used differently in the source data.


Processing at extraction

With the flexible document structure in mind, processing and identifying patterns in that information is a process that happens when you extract the data, rather than enforcing the extraction and reporting process at the point the data is put in.

In a typical RDBMS, the structure is composed of tables and fields based upon how you want that information to be extracted later. For example, with logging information, you might associate a logging point table (contains the date) with a data point table that contains the specific log data. You know that from the process that you can perform a join to connect the logging point with your time data to the temperature and water level information so you can track and monitor values over time (see Figure 1).

Figure 1. Performing a join
Performing a join

The processing of the information is done at the point of input, separating out the information so that it can be inserted into tables and then unified at the point of output by recombining the information. The process requires you to know how you want the information reported, joined, and processed when it is output. You can only write an SQL statement when you know the table structure.

With a document database, it is the process of the raw data that creates the harmonized view of the information that enables the data to be processed, whether that is value-based data or thematic and textual data. The information is put into multiple documents, and the map/reduce system processes that information and generates a structure table from the data (see Figure 2).

Figure 2. Generating a structured table from the data
generating a structured table from the data

With the document database, you process the information after it has been inserted. That means that you can process and even change the way the data is extracted while still using and processing the raw data in its original format. Throughout, you have complete flexibility about how the information should be reported, without losing any of the contextual information.

Following this method does, of course, require some differences to the way that you process the raw data.


Using Map/Reduce

There are many different systems for processing big data and producing and taking the information and processing it for data analytics, especially with document databases. The solutions vary in their approach, from simple querying mechanisms similar to those available with an SQL database, to more complex program-based parsing and understanding. The latter solution is often used in situations where you need to model and understand the underlying data so that your processing system can build an understanding of the core information into a format suitable for your processing and summarizing.

By far, the most common of these techniques is Map/Reduce. This takes two stages: the map to extract the information and the reduce function to simplify and summarize that data.

The map function's role is to take the input information, the documents that you have stored, and simplify them into a format that provides a harmonized output suitable for analysis. For example, using the earlier logging data, you might take the individual temperature data and the multiple point temperature data and output that information as a single data point, based upon the date information and the temperature, as shown in Listing 4:

Listing 4. Outputting individual temperature data and multiple point temperature data as a single data point
function (doc, meta) {
  if (doc.temperature && doc.temperature["mid"]) {
    emit(doc.datestring, parseFloat(doc.temperature["mid"]));
    emit(doc.datestring, parseFloat(doc.temperature["top"]));
    emit(doc.datestring, parseFloat(doc.temperature["bot"]));
    
  } else
  {
    emit(doc.datestring, parseFloat(doc.temperature));
  }
}

The map shown in Listing 4 is written in JavaScript and designed for use within Couchbase Server, although it would work in CouchDB, and the basic principles would also work within Hadoop. The call to emit generates a 'row' of information, in this case a key and a value. You can see a sample of the raw data output in Listing 5):

Listing 5. Raw data output
{"total_rows":404,"rows":[
{"id":"1334307543","key":"Fri Apr 13 09:59:03 2012","value":22.6132600653245},
{"id":"1334307543","key":"Fri Apr 13 09:59:03 2012","value":25.903221768301},
{"id":"1334307543","key":"Fri Apr 13 09:59:03 2012","value":29.0646016268462},
{"id":"1322793686","key":"Fri Dec  2 02:41:26 2011","value":22.7931975564504},
{"id":"1322793686","key":"Fri Dec  2 02:41:26 2011","value":23.8901498654126},
{"id":"1322793686","key":"Fri Dec  2 02:41:26 2011","value":23.9022843956552},
{"id":"1292554769","key":"Fri Dec 17 02:59:29 2010","value":26.55},
{"id":"1324617141","key":"Fri Dec 23 05:12:21 2011","value":24.43},
{"id":"1296843676","key":"Fri Feb  4 18:21:16 2011","value":23.75},
{"id":"1297446912","key":"Fri Feb 11 17:55:12 2011","value":24.56}
]
}

The id in the output shown in Listing 5 is the document that generates the row (from the emit() call). In this case, you can see that the first and second records come from documents with multiple temperature sensors because the IDs are identical.

The critical point in data mining with Map/Reduce is to ensure that you collect the right information and data fields to construct the information you want. Within Map/Reduce, the format of the map is critical. You output a key and the associated value. The value is during the reduction phase, and I will get to the effective writing of that information in a moment. But picking the right value is critical. When processing text, the value might be the thematic analysis of the string or sentence being examined. When analyzing complex data, you might choose to combine multiple data points; for example, mining sales information, you might pick the unique user, product, and location combination.

The key is important during data mining because it provides the basis for how information is collated. In the example shown in Listing 5, I picked the date as a whole, but you can build more complex structures to allow for more complicated selections. For example, with a date, if you split the date into component parts (year, month, day, hour, minute), then you can group the information according to different rules.

Listing 6 shows a modified version of the map that explodes the date into individual parts:

Listing 6. Modified map that explodes the date into individual parts
function (doc, meta) {
  if (doc.temperature && doc.temperature["mid"]) {
    emit(dateToArray(doc.datestring), parseFloat(doc.temperature["mid"]));
    emit(dateToArray(doc.datestring), parseFloat(doc.temperature["top"]));
    emit(dateToArray(doc.datestring), parseFloat(doc.temperature["bot"]));
    
  } else
  {
    emit(dateToArray(doc.datestring), parseFloat(doc.temperature));
  }
}

This generates slightly modified map output with the date as an array (see Listing 7):

Listing 7. Modified map output with data as an array
{"total_rows":404,"rows":[
{"id":"1291323688","key":[2010,12,2,21,1,28],"value":23.17},
{"id":"1292554769","key":[2010,12,17,2,59,29],"value":26.55},
{"id":"1292896140","key":[2010,12,21,1,49,0],"value":25.79},
{"id":"1293062859","key":[2010,12,23,0,7,39],"value":23.5796487295866},
{"id":"1293062859","key":[2010,12,23,0,7,39],"value":26.7156670181177},
{"id":"1293062859","key":[2010,12,23,0,7,39],"value":29.982973219635},
{"id":"1293403599","key":[2010,12,26,22,46,39],"value":22.2949007587861},
{"id":"1293403599","key":[2010,12,26,22,46,39],"value":24.1374973576972},
{"id":"1293403599","key":[2010,12,26,22,46,39],"value":27.4711695088274},
{"id":"1293417481","key":[2010,12,27,2,38,1],"value":25.8482292176647}
]
}

Now you can combine that with a reduce function to provide summary data across different ranges. Reduction functions take the output of a map() function and summarize that information according to the selected key structure down into a simpler format. Common examples are sums, averages, or counts. Listing 8 gives an example of an reduce function that calculates the average:

Listing 8. Reduce function that calculates the average
function(keys, values, rereduce) {
    if (!rereduce){
        var length = values.length
        return [sum(values) / length, length]
    } else {
        var length = sum(values.map(function(v){return v[1]}))
        var avg = sum(values.map(function(v){
            return v[0] * (v[1] / length)
            }))
        return [avg, length]
    }
}

Because of the nature of the reduction system, you have to handle both the original average (computed from the output of the map() function) and a re-reduction (where the output of the first-level reduction is combined with others into the final reduction for the given input range).

The bulk of the function calculates the average on the input data (an array of values from the map() function) and then calculates the average by dividing the sum total by the count.

When first accessed, the entire dataset is grouped and processed, resulting in the average for all of the stored data (see Listing 9):

Listing 9. The dataset is grouped and processed, resulting in the average for all the stored data
{"rows":[
{"key":null,"value":[26.251700506838258,400100]}
]
}

By outputting the date information as an array, you can use the array components as selection criteria for the generated data. For example, if you specify a group level of one, the information will be grouped by the first element of the array, that is, the year (see Listing 10):

Listing 10. Grouping the information by the first element of the array
{"rows":[
{"key":[2010],"value":[26.225817751696518,17484]},
{"key":[2011],"value":[26.252118781247404,199912]},
{"key":[2012],"value":[26.253719707387862,182704]}
]
}

If you specify a group level of 3, you can get the information summarized to individual year/month/day combinations (see Listing 11):

Listing 11. Summarizing the information to individual year/month/day combinations
{"rows":[
{"key":[2010,11,30],"value":[26.23524809151833,505]},
{"key":[2010,12,1],"value":[26.37107941210551,548]},
{"key":[2010,12,2],"value":[26.329862140504616,547]},
{"key":[2010,12,3],"value":[26.31599258504074,548]},
{"key":[2010,12,4],"value":[26.389849136337002,548]},
{"key":[2010,12,5],"value":[26.175710823088224,548]},
{"key":[2010,12,6],"value":[26.21352234443162,548]},
{"key":[2010,12,7],"value":[26.10277260171637,548]},
{"key":[2010,12,8],"value":[26.31207700104686,548]},
{"key":[2010,12,9],"value":[26.207143469079593,548]}
]
}

You can use the reduction functions to summarize and identify a variety of different information. You can combine a more complex reduce() function with a text representation of the temperature between different value levels (warning, error, and fatal) and merge them into a single structure (see Listing 12):

Listing 12. Combining a more complex reduce function with a text representation into a single structure
function(key, values, rereduce)
{   var response = {"warning" : 0, "error": 0, "fatal" : 0 };
  for(i=0; i<data.length; i++)
  {
    if (rereduce)
    {
      response.warning = response.warning + values[i].warning;
      response.error = response.error + values[i].error;
      response.fatal = response.fatal + values[i].fatal;
    }
    else
    {
      if (values[i] == "warning")
      {
        response.warning++;
      }
      if (values[i] == "error" )
      {
        response.error++;
      }
      if (values[i] == "fatal" )
      {
        response.fatal++;
      }
    }
  }
  return response;
}

Now you can report the counts of individual instances of these errors across whatever date/time range combination, for example, by month (see Listing 13 ):

Listing 13. Individual instances of errors counted by month
{"rows":[
{"key":[2010,7], "value":{"warning":4,"error":2,"fatal":0}},
{"key":[2010,8], "value":{"warning":4,"error":3,"fatal":0}},
{"key":[2010,9], "value":{"warning":4,"error":6,"fatal":0}},
{"key":[2010,10],"value":{"warning":7,"error":6,"fatal":0}},
{"key":[2010,11],"value":{"warning":5,"error":8,"fatal":0}},
{"key":[2010,12],"value":{"warning":2,"error":2,"fatal":0}},
{"key":[2011,1], "value":{"warning":5,"error":1,"fatal":0}},
{"key":[2011,2], "value":{"warning":3,"error":5,"fatal":0}},
{"key":[2011,3], "value":{"warning":4,"error":4,"fatal":0}},
{"key":[2011,4], "value":{"warning":3,"error":6,"fatal":0}} 
]
}

These are simplistic examples designed to demonstrate the power and flexibility of map/reduce, but it is easy to see how to cope with different document formats and structures in your source information, and also how to summarize and extract information during that process.


Chaining map/reduce

As you have already seen, Map/Reduce is a practical method for parsing and processing large quantities of data, whether or not your source information and data is stored in a suitable database to begin with.

Within Map/Reduce, though, there are limitations on the amount of information linked (implicitly or explicitly) across different documents that can be combined and reported together. Equally, for very complex information, a single Map/Reduce process may be unable to handle the process in the single parse of the Map/Reduce combination.

Particularly when compared to traditional RDBMS of all kinds, the inability to use a JOIN to combine the information across multiples (including the same table) makes certain operations within the Map/Reduce framework impossible to achieve in a single step.

For example, when processing textual information and performing a thematic analysis on the data, you could go through a process, first of processing the information and sentences from the raw material and dividing that raw source data up into individual document blocks. That could be achieved through a simple map() that processes the information into new documents.

A secondary step could then process the extracted information to provide more detailed analysis on the individual fragments and collate and count that information (see Figure 3):

Figure 3. Processing the extract information to provide more detailed analysis
Processing the extract information to provide more detailed analysis

From a practical perspective, Map/Reduce chaining is a case of running a Map/Reduce function for your first level analysis and using the key of the output (from your map() function) as the document ID to then hold the data structure that was output for that reduced row.

In the earlier logging example, you might have taken the combined warning/error/fatal structure, written that as new data into a new database (or bucket as used in Couchbase), and then run additional processing from that bucket to identify trends or other data (see Figure 4):

Figure 4. Running additional processing to identify trends or other data
Running additional processing to identify trends or other data

Remember that the direct Map/Reduce is only one element of the data. If you are exporting and combining information from other sources during processing, you can also use and combine that information at this stage into the data stream.

For example, when looking at sales information, you might have customer information in one database, customer sales in another, and product information in another. By running Map/Reduce on the sales data, exporting to another database, combining that with the product data, and then reducing again before combining with the customer information, it's possible to generate complex data on seemingly disparate sources.


Conclusion

Document databases completely flip the normal data structure and processing rules on their head. No longer do you analyze, construct and format the information into a tabular format for easy processing. Instead, you store the data in whatever format it comes in, and then do the processing that analyzes the data, handling the different formats and structures as you go. That process can simplify vast quantities of data down into smaller chunks very easily. The back-end process that allows for very large, distributed databases usually enhances, rather than hinders, the process. You can process much larger and more complex data making use of the multiple node architecture. As this articles shows, your primary concern is constructing a suitable document and writing the right Map/Reduce combination to process that data effectively.

Resources

Learn

Get products and technologies

  • Hadoop supports some of the NoSQL structure, such as the schemaless format and the ability to use a Map/Reduce to process the stored data.
  • InfoSphere Warehouse is a complete toolset for building and analyzing data supporting many of the data mining techniques.
  • WEKA is a Java-based toolkit that supports a variety of different data mining and statistical algorithms.
  • SPSS is a statistics package that includes powerful predictive analysis capabilities.
  • Couchbase Server is a document NoSQL database with Map/Reduce based querying and indexing.
  • Innovate your next open source development project with IBM trial software, available for download or on DVD.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Open source
ArticleID=857377
ArticleTitle=Data mining in a document world
publish-date=02122013