Document databases in predictive modeling

Predictive analytics relies on processing, analyzing data from many different sources, collating, and then processing that through several stages into usable data. This involves recording and storing data in different formats, and may require translating information into PMML. Despite the complexities and structure of the information, and the sources often involving data from traditional RDBMS data sources, other solutions offer some advantages. We can use the recent range of document-based NoSQL databases to help collate the information in a structured format, while coping with the flexible structure of the individual data points. Many NoSQL environments also provide support for extensive map reduce type queries and processing that makes them ideal for processing large volumes of data into a summary format. In this article, we'll look at the transfer, exchange, and formatting of information in NoSQL environments.

Share:

Martin Brown, VP of Technical Publications, Couchbase

author photoA professional writer for over 15 years, Martin 'MC' Brown is the author and contributor to over 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft® WP, Mac OS and more. He is a former LAMP Technologies Editor for LinuxWorld magazine and is a regular contributor to ServerWatch.com, LinuxPlanet, ComputerWorld, and IBM developerWorks. He draws on a rich and varied background as founder member of a leading UK ISP, systems manager and IT consultant for an advertising agency and Internet solutions group, technical specialist for an intercontinental ISP network, and database designer and programmer and as a self-confessed compulsive consumer of computing hardware and software. MC is currently the VP of Technical Publications and Education for Couchbase and is responsible for all published documentation, training program and content, and the Couchbase Techzone.



08 October 2012

Also available in Portuguese Spanish

Introduction

Today, whether you are aware of it or not, it is difficult to get away from predictive analytics. It is used in a wide variety of different data environments, from credit card checks, to stock management, to security and safety monitoring.

At each point in the process, the input of data and a situation is compared with a predictive model that includes the data required for the system to be able to make a judgment. For example, when monitoring the sensors for a security system, the system may have to compare the incoming sensor information with a model that defines whether this is a warning or alarm situation. A cat walking by a sensor may trigger minor activity, such as enabling a video recording, but not the full alarm, which requires a larger heat source registration.

The challenge in each of these situations is combining the incoming data with the background information, and then making a judgment. The use of open standards like PMML is often critical, but you will probably want to combine the PMML structure with a past history, or to log the incoming data triggers as part of your entire analytics environment.

For example, with a credit card transaction, predictive analytics is typically used to capture unusual behavior, but you need to have a body of information available that describes what you know about the customers' existing behavior, where they shop, the typical size of their transactions, and the time and dates at which they make them.

Storing and processing this background information and material can be a primary source of the problem. Insufficient input and past historical data within your model will lead to bad decisions, such as a completely valid transaction being declared a possible fraud, causing embarrassment to your customer and additional time for you to work with them to resolve the issue.

Moving up a level in terms of the requirements, a secondary issue is the speed with which you can obtain and process the information. Databases are designed to handle the querying of the data they store, but the level of processing time required to get that information, and convert it into a model that you can use within your predictive analytics layer, is also critical. You don't want to keep a customer waiting more than a few seconds while validating a credit card transaction. Generally, the quicker it happens the better. More significantly, you will need to handle many millions of other transactions within the same period.

With an alarm or monitoring system, the difference between an immediate response and a few seconds could make the difference between preventing a crime or being able to halt production and a much more catastrophic alternative.


Collating data into NoSQL

Collecting and collating valid background information has traditionally involved recording that information into a heavily structured relational database (RDBMS) such as IBM DB2. The first step to using such a system for recording the information is creating a suitable data structure, or schema. For example, we can use a model like the one shown in Figure 1 to record credit card transactions.

Figure 1. Recording credit card transactions
Recording credit card transactions

Recording data into this model requires that you either have all of the information available, or that you have at least the core information (customer, merchant, or cost) on which to base your later predictions.

But what happens if the level of data in this scenario changes? If you've looked at your credit card statement recently, particularly online, you might notice that there is a significant amount of information available about your transaction, sometimes including the detail of what you bought, what type of product it was, and whether you were there in person or completing the transaction online or over the phone.

Although we could alter the detail of the information stored in our DB2 database by adding additional fields, it can be expensive comparatively, particularly when we might be completing this on a table with millions, or even billions of rows of information.

Consider our alarm sensor, what if the basic trip information—that is, the sensor has detected something—has been updated with a more advanced sensor that can detect the temperature of the seen object? Then later what if it is updated with the approximate size and temperature? Altering the table each time the quality or quantity of information changes is not only difficult and time consuming within a traditional RDBMS, it also increases the complexity of getting that information back out again.

For example, to create the seed and comparison data you might combine a number of different SQL queries, such as: SELECT MIN(amount),MAX(amount) FROM transactions WHERE customerid = XX.

To get the minimum and maximum transaction amounts for the customer: SELECT UNIQUE(country) FROM transactions WHERE customerid = XX.

To get the country information about the customer, in case they regularly buy from that destination: SELECT COUNT(tid) FROM transactions WHERE customerid = XX and merchantid = YY.

To determine if the customer has ever used the company before, and finally: SELECT MIN(amount),MAX(amount) FROM transactions WHERE merchantid = YY.

The combination of these queries only scratches the surface of the type of information that you may typically want to collect. In reality, you will probably want to take a deeper look at the transactions, maybe even mapping individual transactions, their frequency, and their values in order to arrive at a decision in order to build up your model.

These changes not only alter the database itself, but also how the information is extracted, retrieved, and processed during the analysis stage, all of which takes time and adds complexity.

Document databases

There are many different aspects to the NoSQL movement, but one of the most critical is the move away from the traditional schema-based data modeling of the RDBMS to a more flexible schema-less document-based style.

Within a document database, instead of defining a strict schema, which involves creating a table definition and then populating it, information is stored as a document. It is thus stored either by serializing an object (for example a Java-based instance of an object class), or by using a recognized data standard such as JavaScript Object Notation (JSON).

The primary benefit of the document model is that the structure does have to be pre-defined. Let's look at a sample of the credit card transaction, this time as a JSON document (see Listing 1).

Listing 1. Listing 1. Sample of the credit card transaction
{
    "country" : "UK",
    "transaction_country" : "UK",
    "amount" : 23.23,
    "customerid" : "3458983734981294",
    "merchant_country" : "UK",
    "type" : "inperson",
    "merchantid" : "49587",
    "datetime" : "2012-08-23T13:54:00"
}

The structure of this information allows for the construction of a document that contains a large amount of information, much of which is constant and what would expect to appear in this type of document, but the individual fields are optional. You can enforce the level of requirement at the application layer, rather than at the database layer, allowing the information to be inserted and updated rapidly, even as the content and information expands.

To use our example again, if the transaction adds detailed information on what was purchased, it can be included in the document structure (see Listing 2).

Listing 2. Listing 2. Transaction with more detailed information on what was bought
{
    "country" : "UK",
    "merchant_country" : "UK",
    "datetime" : "2012-08-23T13:54:00",
    "amount" : 23.23,
    "transaction_country" : "UK",
    "customerid" : "3458983734981294",
    "type" : "inperson",
    "merchantid" : "49587",
    "items" : [
       {
          "amount" : 10.2,
          "description" : "food"
       },
       {
          "amount" : 13.03,
          "descrition" : "DVD"
       }
    ]
}

Another record, this time from a different merchant, contains a different level of information supplied by a different transaction system for the same customer (see Listing 3).

Listing 3. Record containing different level of information
{ 
    "country" : "US", 
    "description" : "Electrical goods",
    "datetime" : "2012-07-01T01:54:00",
    "transacttype" : "internet",
    "transaction_country" : "US", 
    "value" : 192.48, 
    "tax" : 34.29, 
    "addresssupplied" : "yes", 
    "checkcode" : "474", 
    "customerid" : "3458983734981294", 
    "merchantid" : "12875" 
}

Despite the differences in the input data in each case, and the difference in the level of information supplied by each transaction and system, there are some common elements that you can identify and extract from the database. Better still, you can tag the records to indicate either a version (if it's a progression of data), or by using a tag structure to highlight the format.

For example, looking at each document we can still determine the value of each individual transaction, even though the information is in an amount field in one document, and the value field in another.

Within a NoSQL document based solution, the extraction and identification of the information can happen at the point where the information is retrieved from the database, rather than having to process and insert the information into the right fields at the time the data is received. The MapReduce model used by NoSQL databases such as Couchbase or Hadoop can easily cope with the differences in structure and format in each case.

Instead, it inserts data using different document structures (but with common elements), allowing for differences in processing, quantity or quality of the data, and even differences in the field names.

Using the flexible model offered by the document structure provides key benefits, inclduing:

  • Speed of insertion

    Many NoSQL alternatives eschew the typical indexing and ACID compliance in traditional databases to improve the speed, particularly when adding information.

  • No schema

    Without a fixed schema, there is no need to construct a complex table or database format, or create a complex structure or query to update it.

  • Extensible

    If your structure changes, for example because your data format or source change, then you can perform this change on the fly and handle it within the application. With predictive data there are often changes, expansions and improvements to the source data.

  • Flexible

    Not only can you store the data, and changes made to it, you can also store composite records and information that would be difficult (or require complex queries) within a traditional RDBMS. After building a predictive model, you can store the entire model back into the NoSQL database.

These benefits should not be underestimated; the ability to store whatever you want, without the rigid structure and without having to change or alter your table format as your information sources and application evolve are ideal for the ever-changing and expanding nature of predictive analytics.

Scalability

A key difficulty with many traditional RDBMS solutions is how you cope with an ever-growing set of data. There are numerous different strategies available to you, including scaling up your hardware, using caching layers like Memcached or IBM solidDB, and using replication and other techniques to distribute the data across multiple machines and allow the data to be updated, and read, from multiple hosts.

While any, or all, of these solutions are completely viable, they come with a substantial overhead at the point of building and deploying your chosen solution. Using SQL, especially if you want to execute queries that require a join of any kind, complicates the process as you must have access to not only the main table, but also all the data in the joined tables as well.

The document model often associated with the NoSQL database eliminates some elements of this process. Given that the data can be self-enclosed within the document structure, the requirement for joins is not always required. In addition, the document ID (or key) addressing nature of the system allows for data to be more easily distributed across multiple nodes without having to cope with the problem of keeping the other table data coherent and available, or worrying about where it will be accessed from.

Most NoSQL solutions are deliberately designed with such scalability in mind. You can use many of the larger, big data solutions, such as Hadoop, to help collate the core information and create the predictive model that you need when making decisions. Hadoop and others are specifically designed to handle such large volumes of data, and can make use of the MapReduce functionality built into these systems to collect the information from the cluster and create the prediction model that will be used when a transaction is requested.


Processing raw data within your NoSQL environment

One of the major advantages of the NoSQL database is the combination of the flexibility in which you store the information (as we've already seen), and how that information can be extracted, processed and stored into the database again for easy and quick usage the next time around.

Many of the different NoSQL databases support some sort of processing engine that processes and most often reduces or summarizes that information. For example, Hadoop is especially designed to perform MapReduce operations on source data and to convert that information into a simplified format.

With credit card information, a MapReduce function could be created to process the past transactions for a customer to work out the typical transaction, countries and other limits into a structure suitable for performing the final analysis. This could take into account all of the parameters you need, in addition to all of the past transactions (which would be a significant quantity of information for a frequent card user). Hadoop is limited in that you cannot always query it directly, at least not in the format most useful for predictive analytics.

Couchbase Server 2.0 also provides MapReduce functionality that can perform the same level of analysis. With Couchbase, the MapReduce information is stored in an index, making it very quick to get the information back out again when you need it. In addition, Couchbase supports incremental MapReduce. This allows you to change (or expand) the source data used to create the MapReduce index and update only the information that changed.

For example, you can create a map function within Couchbase Server to process our sample credit card record amounts using the code in Listing 4 below.

Listing 4. map function in Couchbase Server to process sample credit record
function(doc) {
    if (doc.value) { 
        emit(doc.customerid, doc.value);
    } else { 
        emit(doc.customerid, doc.amount);
    }
}

The emit() function call outputs a row of data consisting of the key (our customer's credit card number), and a value (the size of the transaction). The value could be larger, either an array, or even a hash of data, meaning that we could actually generate a completed predictive model.

The map function is responsible for mapping the input fields to the output data, and so is ideal for handling these different input record formats. The map function here is written in JavaScript and you can provide complex determination and calculations within the map function.

The map function output is processed and stored into an index for quick access. The process will only happen once for each document in the database. The reduce function also stores it's output in the index, making it easy to construct the data.

The reduce function is used to summarize the information, from the more typical sums and counts to more complex structures, like the highs and lows that we might associate with the credit card predictive model (see Listing 5).

Listing 5. The reduce function
function(key, values, rereduce) {
    var result = {total: 0, count: 0};
    for(i=0; i < values.length; i++) {
        if (values[i] < result.min) { 
            result.min = values[i];
        }
        if (values[i] > result.max) { 
            result.max = values[i];
        }
        if(rereduce) {
            result.total = result.total + values[i];
            result.count++;
        } else {
            result.total = result.total + values[i].total;
            result.count = result.count + values[i].count;
        }
    }
    return(result);
}

Notice here that the function is returning a compound result, made up of a JavaScript object that contains the total, count, min, and max fields. Fortunately, our map() function has determined the correct field from which to find that information. The reduce function only needs to reduce that to a structured result.

We could output the final predictive model data as part of this reduce process, and have a predictive model completely generated from the database (and stored in the index for fast access). We can then use that model immediately to perform the analysis.


NoSQL Predictive Data lifecycle

You can see an example of the lifecycle of information that you could use with a NoSQL database in Figure 2.

Figure 2. Lifecycle of information with a NoSQL database
Lifecycle of information with a NoSQL database

The model is based upon the credit card transaction model we have been looking at throughout this article. The system relies on two main datasets:

  • Transactions

    These are the raw data records of all historical credit card transactions. The prediction system populates the information when a transaction is approved.

  • Predictive model

    By using MapReduce, the transactions are processed into a single predictive model structure. The predictive structure is stored within the NoSQL database as a single record, allowing the prediction system to retrieve and use it when an incoming transaction hits the system.

Further datasets could be added to the system to make use of other information which could then be combined with the transaction data to help form the predictive model document. For example, you might combine knowledge about known trips (from the transaction data).

The processing of the raw transaction information creates the predictive model, which is stored in the NoSQL database where the decision engine can use it for the next transaction.


Modeling input data into documents

We've already seen some examples above of how you can model different pieces of information in your NoSQL database. Depending upon the NoSQL database that you have chosen, the exact format and structure of the information that you want to store can differ.

The most practical NoSQL databases use JSON as a storage format, which offers additional benefits for some systems. For example, within some NoSQL databases the use of JSON enables you to update individual fields within the document, instead of retrieving and updating the entire document when a modification is required.

JSON also has other practicalities, in that it can easily be read, formatted, and exchanged regardless of the platform or application language. This makes it ideal for exchanging information across a range of different application environments. Finally, you can use JSON as a data interchange format for application objects. Many environments support the translation of an internal object into a JSON structure and back again.

There are some common and recommended practices for modeling your data within a document or JSON structure, particularly when storing data for a predictive model:

  • Store the common, easily used information at the top level within the document. The transaction amounts, countries, and other direct information is stored within a single field. This makes it easy to access and process.
  • Use arrays to store collections of data that you would normally process in their entirety. For example, if your model specifically uses the individual items in a transaction every time, then store that information into an array that can easily be iterated over.
  • Use hashes when you want to perform a quick check for information in a structured format. This is because you can check for the existence of a field within a hash in just a single line. If you store the information in an array, you will have to iterate over the array which can be time consuming.

Using these basic rules, you can model any (relatively) structured source of data into a JSON or similar document that can be stored into a NoSQL database.


PMML to documents and back again

If your chosen predictive modeling environment uses the Predictive Model Markup Language (PMML) standard ensure that your document generation within the NoSQL MapReduce stages is in a format that is easy to convert to PMML. The best approach is not to create a PMML document structure that can be read back, but use the ability of a NoSQL database to store documents—that is, the completed predictive model—and use that as the basis for your PMML data.

There are several ways you can do this within the NoSQL processing layer:

  • Create a structure in JSON (or your chosen output format) that matches the PMML format. You can generate the data dictionary portions as an array of different elements within the JSON itself. Converting this format to PMML requires an additional layer of conversion, but also enables you to more easily consume and reprocess the source information without re-generating the PMML structure.
  • Create the PMML alongside the JSON data. This process is longer and more time consuming (and can be more error prone), but has the benefit that the PMML model already exists at the point it is required.
  • Or use a combination of the two. Generate the JSON format of the model, and store that in the NoSQL database. Then generate a PMML model if it's required, and store that into the NoSQL database so that it can be updated for the next time it is required. This solution requires more storage space within your NoSQL datastore, but you can use the TTL or expiry functionality of the NoSQL database if it is offered to automatically delete the PMML model if it hasn't been used for some time.

Of course, any solution that needs to regenerate the model from the input data (as in our credit card example) will imply additional CPU resources.


Conclusion

NoSQL databases completely remove the need to create a schema for the data that you want to store, and this has huge advantages when you need to consume information from a variety of different sources, but will ultimately provide some core of the same information. In this article you've seen how the different structures that can be stored in NoSQL can store the same information, but in different ways, and how you can use the processing ability of the NoSQL database to convert these differing structures into a standardized format, and then further use that to create a predictive model.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Open source
ArticleID=839166
ArticleTitle=Document databases in predictive modeling
publish-date=10082012