IBM Accelerator for Machine Data Analytics, Part 5: Speeding up analysis of structured data together with unstructured data

Previously in this series, you created a searchable repository of semi-structured and unstructured data — namely, Apache web access logs, WebSphere® logs, Oracle logs, and email data. In this tutorial, you will enrich the repository with structured data exported from a customer database. Specifically, you will search across structured customer information and semi-structured and unstructured logs and emails, and perform analysis using BigSheets to identify which customers who emailed Sample Outdoors Company during the 14 Jul outage were more loyal than others.

Sonali Surange (ssurange@us.ibm.com), Architect, Machine Data Accelerator, IBM

Author photoSonali Surange is an IBM software architect working on IBM's big data products and technologies. She has filed numerous patents, published more than 15 technical papers with IBM developerWorks, and presented in numerous technical conferences. She is a past recipient of the IBM Outstanding Technical Achievement Award, Women of Color STEM Technical All Star Award, and was recognized as an IBM developerWorks Professional Author in 2012.


developerWorks Professional author
        level

28 May 2013

Also available in Chinese

Analyzing structured data with unstructured data

Big data is about all data. Enterprise data such as customer information, order information is typically stored in databases and is available in structured formats. Enriched analysis is possible by mixing structured enterprise data with semi-structured or log data and social data. Configurations or KPI reports can also exist in structured formats and are critical to the end-to-end analysis of an IT infrastructure.

When it comes to data, an "include-all" approach is necessary for any big data analysis to truly open up all the pathways leading to new insight.

Once a normalized repository of structured, unstructured, semi-structured information is set up using IBM's Accelerator for Machine Data Analytics, business analysis and data scientists are enabled to perform ad-hoc analysis and set up dashboards.


Before you start

About this series

One of the primary advantages and strengths of IBM's Accelerator for Machine Data Analytics is the capability and ease with which the tools can be configured and customized. This "IBM Accelerator for Machine Data Analytics" series is for those who want to get an introduction to the accelerator and further accelerate the analysis of machine data with the idea of getting custom insights.

About this tutorial

Previously in this series, you created a searchable repository of semi-structured and unstructured data — namely, Apache web access logs, WebSphere logs, Oracle logs, and email data. Here, you will enrich the repository with structured data exported from a customer database. Specifically, you will search across structured customer information and semi-structured and unstructured logs and email messages, and perform analysis using BigSheets to identify which customers who emailed the Sample Outdoors Company during the 14 Jul outage were more loyal than others.

Objectives

In this tutorial, you learn how to:

  1. Enrich the repository of normalized data with new structured data.
  2. Enrich the searchable repository of normalized data with new structured data.
  3. Optionally perform ad-hoc searches on structured, semi-structured, and unstructured data.
  4. Perform ad-hoc analysis on structured, semi-structured, and unstructured data using BigSheets.

Prerequisites

System requirements

To run the examples in this tutorial, you need:

  1. BigInsights™ 2.0 installed
  2. IBM Accelerator for Machine Data Analytics installed
  3. A data set for machine data analysis (see Downloads)

The situation at a fictitious company

Sample Outdoors Company data scientists wanted to marry customer information from its enterprise customer database with machine and social data already analyzed. After the 14 Jul incident, the company wanted to put a system in place to help quickly identify loyal customers when such incidents happened in the future. This would enable corrective actions toward improving customer retention by offering deals toward future purchases.


Prepare structured customer data for analysis

The data scientists at Sample Outdoors Company exported customer data from their customer database into a structured comma-separated format, which was made available for analysis.

Perform the following steps take a peek into the samples provided representing this data:

  • From the Download section, download data.zip and unzip it.
  • Copy data/structured_batch to a machine on your BigInsights cluster. For this tutorial, you will use the location /opt/ibm/structured_batch.
  • The directory structure containing the batches; structured_batch contains a batch containing customer data.
  • Batch_custInfo containing customer information such as customer name, email, and year they became customers of Sample Outdoors Company.

The customer data in comma-separated format is shown in Listing 1.

Listing 1. Customer data in comma-separated format
customerName,customerEmail,customerSince
John Doe,john.doe@gmail.com,2011
Mary Jane,mary.jane@yahoo.com,2013
Tony Hall,tony.hall@yahoo.com,2005
Ann Cruz,acruz@hotmail.com,2008
Gaby Kruger,gkrug@gkruger.com,2013

Any delimiter-separated file can be processed. Simply change the value for delimiter to match your data. Review metadata.json for this sample in Listing 2.

Listing 2. metadata.json for customer data
{logType:"csv", batchId:"batch_custInfo", dateTimeFormat:"yyyy", delimiter:",", 
   missingDateTimeDefaults:[{"month":"Jan"},{"timezone":"PST"},{"date":"01"}] }

Enrich repository of normalized data with structured customer data

In Part 3, you created a normalized repository of log and email data. Next, you will add the customer information data to your existing normalized repository of data containing logs and email messages.

Extract App

When running Extract App with larger data, ensure that all results are exported in CSV form. In /accelerators/MDA/extract_config/extract.config, change the setting for NUM_REC_PARM from Top 2000 to All.

You will use the Import and Extract steps similar to the ones used in Part 3. Following is a review of the steps:

  • Follow step 3 onward from Part 3's "Bring in and extract the data" to run the import-extract chain. By keeping the same output path, you will be appending customer information to the existing repository of normalized logs and email messages.
  • The output location will now have a Batch_customerInfo directory with batch_customerInfo.csv. You will use this result for further analysis in BigSheets.

You will use this result for further analysis in BigSheets in "Dashboard of active loyal customers."


Enrich repository of searchable data with structured customer data

Optionally, you can see how searches can be performed across all the data. You will add the customer information data to your existing searchable repository of data containing logs and email messages. You will use the Index app, similar to Part 3. Following is a review of the steps:

  • From the BigInsights console, click the Applications tab and select the Tree view icon.
  • Expand the Machine Data Analytics: Search folder and select the Index application.
    Source directory: /GOMDADemo/output/extract_out (the Index app will only index the new batch this time).
    Output path: /GOMDADemo/output/index_out (by keeping the same output path as in Part 3, you will be incrementally appending customer information to the existing repository of searchable logs and email messages).
  • Run the Index app.
  • Follow the steps in Part 3's "Prepare for search, observe automatic discovery of facets."
  • You will notice that new facets from customer information — customerName, customerSince, and customerEmail — were automatically added, as shown below.
Figure 1. Facets for structured customer information along with facets from unstructured data
Image shows facets for structured customer information along with facets from unstructured data

Search for loyal customers

Following the steps to zoom in on errors on 14 Jul, using steps in Part 3's "Search!," you will now notice that the following customers complained during the 14 Jul outage.

Listing 3. Customers who complained during 14 Jul outage
2012-07-14 15:58:35.000 GMT" From:john.doe@gmail.com 
   To:customersupport@sampleoutdoors.com logType  :  email 
   Raw Log: Date: Sat, 14 July 2012 08:58:35 -0700 (PST) From: john.doe@gmail.com 
   To: customersupport@sampleoutdoors.com Subject: FW: Cannot purchase Mime-Version: 1.0 
   Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: 
   john doe X-To:

"2012-07-14 15:58:52.000 GMT" From:mary.jane@yahoo.com To:websupport@sampleoutdoors.com 
   logType  :  email 
   Raw Log: Date: Sat, 14 July 2012 08:58:52 -0700 (PDT) From: mary.jane@yahoo.com 
   To: websupport@sampleoutdoors.com Subject: Problem with purchases 
   Cc: customersupport@sampleoutdoors.com Mime-Version: 1.0 Content-Type: text/plain; 
   charset=us-ascii Content-Transfer-E

Next, Sample Outdoors Company wanted to identify if any of these are longtime customers, so the company could offer them more incentives to stay:

  • Remove the range filter by clicking on the x next to Filter by.
  • Expand the customerEmail facet and click john.doe@gmail.com.
  • Using the resulting values in the customerSince filter, you will notice that John Doe has been a customer since 2013.
  • Similarly, you can filter on mary.jane@yahoo.com and notice that she was a customer since 2011.
Figure 2. Customers who emailed during the 14 Jul outage -- How long were they Sample Outdoors Company customers?
Image shows customers who emailed during the 14 Jul outage and how long they were they Sample Outdoors Company customers

Dashboard of active loyal customers

Sample Outdoors Company considered customers of more than one year their more loyal customers and appreciated that loyalty with greater incentives in coupons.

To get a big-picture view of the loyal vs. newer customers, the company used BigSheets for further analysis. Follow the steps below:

  • Click on the Sheets tab.
  • Create a workbook for batch_inbox.csv found under /GOMDADemo/output/extract_out. Call it Email.
  • Create a workbook for batch_structured.csv found under /GOMDADemo/output/extract_out. Call it customerInfo.
  • Click Build a new workbook, click Add Sheets and Load, then select Email. Call the sheet emails.
  • To join customer information with emails, Click Add Sheets and select Join.
  • Select LeftOuter for join, Emails for the left sheet, and CustomerInfo for the right, as shown in Figure 3.
    Figure 3. Join CustomerInfo with Email
    Image shows joining CustomerInfo with Email
  • Hide all columns except LogDatetimeNormalized (normalized time when emails were sent), CustomerEmail, customerEmail, customerSince (customer information), From, To (email information).
  • Insert a column right of customerSince and call it Loyalty.
  • Apply the formula as shown in Figure 4.
    Figure 4. Customers for two or more years are considered loyal
    Image shows customers for two or more years are considered loyal
  • Call the sheet CustomerInfoWithEmail, save, and exit.
  • Run the workbook.

Next, to build a dashboard

  • Click on the Dashboard tab
  • Create a new dashboard, call it CustomerLoyalty
  • Add a widget and select CustomerInfoWithEmail
Figure 5. Dashboard indicating which customers who sent emails to customersupport or websupport and were more loyal customers than others
Image shows dashboard indicating which customers who sent emails to customersupport or websupport and were more loyal customers than others

You now have visibility into the big picture of the loyalty of the customers who sent email messages to web support or customer support at Sample Outdoors Company.


Summary

In this tutorial, you've learned how to enrich an existing data repository with structured data exported from a customer database. You searched across structured customer information and semi-structured and unstructured logs and emails, and performed analysis using BigSheets.

At Sample Outdoors Company, structured information was successfully analyzed, along with semi-structured and unstructured information. The company was able to enrich its normalized and searchable repositories with all types of data sources. It continued to add structured order details information and enriched the loyalty rules, taking into consideration the frequency and sizes of orders from the past. It looked at all customers and proactively emailed incentives and coupons to improve customer retention.


Download

DescriptionNameSize
Data files for this tutorialdata.zip1KB

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=931074
ArticleTitle=IBM Accelerator for Machine Data Analytics, Part 5: Speeding up analysis of structured data together with unstructured data
publish-date=05282013