IBM Accelerator for Machine Data Analytics, Part 2: Speeding up analysis of new log types

Machine logs from diverse sources are generated in an enterprise in voluminous quantities. IBM® Accelerator for Machine Data Analytics simplifies the task of implementation required so analysis of semi-structured, unstructured or structured textual data is accelerated.

Share:

Sonali Surange (ssurange@us.ibm.com), Software Architect, IBM

Author photoSonali Surange は、IBMのビッグデータ製品とテクノロジーを担当するソフトウェア・アーキテクトです。数多くの特許登録を行っており、IBMのdeveloperWorks上で15件の論文を発表し、数多くの技術カンファレンスで発表を行っています。SonaliはこれまでIBM Outstanding Technical Achievement AwardとWomen of Color STEM Technical All Star Awardを受賞し、2012年にはIBM developerWorks Professional Authorとして表彰されました。


developerWorks Professional author
        level

17 January 2013

Also available in Chinese Japanese

Before you start

About this series

One of the primary advantages and strengths of IBM Accelerator for Machine Data Analytics is the capability and ease with which the tool can be configured and customized. This series of articles and tutorials is for those who want to get an introduction to the accelerator and further accelerate the analysis of machine data with the idea of getting custom insights.

About this tutorial

This tutorial is a step-by-step example of the use of IBM Accelerator for Machine Data Analysis to analyze a completely new type of data. It creates the groundwork for Part 3, which illustrates how you can plug and play this new log type into indexing and searching.

Objectives

In this tutorial, you will learn how to do the following.

  1. Start your analysis of a new data set by using the out-of-the-box support in the accelerator.
  2. Identify missing fields needed for analysis.
  3. Customize the accelerator to create your own log type for subsequent analysis.

Prerequisites

You should be familiar with BigInsights Text Analytics and AQL (Annotation Query Language). Some familiarity with BigInsights Text Analytics tooling is a plus but not required. Read Part 1: Speeding up machine data analysis of this series to get an overview of the IBM Accelerator for Machine Data Analytics.

System requirements

To run the examples in this tutorial, you need the following.

  1. BigInsights v2.0 installed.
  2. IBM Accelerator for Machine Data Analytics installed.
  3. BigInsights v2.0 eclipse tooling installed.
  4. A data set for machine data analysis. Refer to the Download section for the link to download the data.

Variety in machine data analysis

In Part 1: Speeding up machine data analysis of this series, you learned how you can use machine data from some known types such as Apache web access and WebSphere. You also learned how you can use some of the types that are lesser known to the accelerator, using a generic type.

As long as the data is textual time-series-based data, you can use the techniques on any machine data for analysis without writing any new code!

Using the generic type, you will be able to extract most of the fields that are commonly found in machine data. Many times a lot of the data contains name value pairs, XML leaf tag values, and the generic type will extract most of the interesting information.

After using these techniques, if there are still fields that are specific to a certain data type that are not extracted, the accelerator provides a way to customize the existing rules or add new ones.

In this tutorial, you will use email data and learn how you can add a new log type to analyze this data, including the following.

  • How to use the eclipse tooling to customize existing rules or build new ones.
  • How to publish the customized rules via a customized application for production.

The situation at a fictitious Sample Outdoors company

In Part 1: Speeding up machine data analysis of this series, the data scientists at the Sample Outdoors company were able to acknowledge issues reported on Sat July 14th, using their logs across the application stack. They were able to get insights into the potential root causes of this issue.

Many customers were impacted on Sat July 14th and the customer support center was flooded with emails from complaining customers. The Sample Outdoors company risked adverse publicity and also feared losing current and potential customers. One way to remediate the problem was to provide coupons with appropriate savings to these customers for a future purchase. Sat July 14th was one of the busiest days during Sample Outdoors biggest mid-year sale, and huge amount of customers were impacted. Sample Outdoors wanted to prioritize these savings to the specific customers that had contacted the support center via emails.

In order to do this, the Sample Outdoors company needed to get a consolidated view of all the customer orders that were attempted and the customers who were impacted. The Sample Outdoors company already had information about the attempted orders available for analysis. They now wanted to add the customer emails to be able to get enough information about the customers and the size and details on their orders to offer them the appropriate savings in coupons.


Ten features to accelerate machine data analysis for new log types

Take a look at the following overview and highlights of the features of IBM Accelerator for Machine Data Analytics that can be used to analyze your own data type.

  1. To learn how you can prepare your email data for analysis, refer to the Prepare the data for the new log type section.
  2. To use the generic type and learn how to validate the results and identify any missing fields, refer to the Out-of-the-box support section.
  3. To set up your Eclipse environment to work on the Extraction application customization, refer to the Take control! Prepare for customization section.
  4. Get a peek into the Extract application in the Understand the Extraction application section.
  5. Get a peek into Eclipse tooling for Text Analytics in the Peek into the tooling section.
  6. To use new rules to extract fields that are specific to emails and test them, refer to the Create your own email log type section.
  7. Review the text analytics rules used for email data in the Understand the code section.
  8. Understand the naming conventions that allow plug and play of this new log type with the rest of the applications in the Understand the wiring under the hood section.
  9. To publish the customized application to the BigInsights cluster, refer to the Publish the customized application section.
  10. To extract the emails using the customized Extraction application and see the results, refer to the New log type in action! section.

Using emails at the Sample Outdoors company

The data scientists at the Sample Outdoors company wanted to use emails that the customer support center had received. They wanted to get the customer emails for customers who complained during the outage on Sat July 14th. They would then use this information to get the order size information, customer loyalty data, and email the appropriate savings to these customers.

They collected the emails from customersupport@sampleoutdoors.com and websupport@sampleoutdoors.com to start their analysis using IBM Accelerator for Machine Data Analysis.


Prepare the data for the new log type

A prepared batch of email data from customersupport@Sample Outdoors.com are provided in the Download section.

Perform the following steps.

  1. From the Download section, download code_and_data.zip and unzip it.
  2. You will also find a directory called AQL. Keep it handy. You will use email.aql and extractor_email.aql later in this tutorial.
  3. You will find a directory called input_batches. The directory input_batches contains one batch file called Batch_inbox. Batch_inbox contains email data as shown in Listing 1. It represents emails received by the Sample Outdoors company customer support inbox.
    Listing 1. Email data
    Message-ID: <16159836.1075855377439.JavaMail.evans@thyme>
    Date: Sat, 14 July 2012 08:36:42 -0800 (PST)
    From: john.doe@gmail.com
    To: customersupport@sampleoutdoors.com
    Subject: FW: Cannot purchase 
    Mime-Version: 1.0
    Content-Type: text/plain; charset=us-ascii
    Content-Transfer-Encoding: 7bit
    X-From: john doe
    X-To: customersupport
    X-cc: 
    X-bcc: 
    X-Folder: \customersupport_July2012\Notes Folders\Inbox
    X-Origin: customersupport
    X-FileName: customersupport.nsf
                                            
    Hi
    I am still not able to purchase items on Sample Outdoors. I urgently need
    to get these items. 
                                            
    Thanks,
    John
                                            
                                            
    -----Original Message-----
    From: 	Doe, John  
    Sent:	Saturday, July 14, 2012 4:06 PM
    To:	customersupport@Sampleoutdoors.com
    Subject: Cannot purchase
                                            
    Hello,
                                            
    I am having trouble purchasing items on your website. Is there a known issue,
    any estimate on when it will be fixed?
                                            
    Thanks,
    John
    
    Message-ID: <13556517.1075852726971.JavaMail.evans@thyme>
    Date: Sat, 14 July 2012 08:59:02 -0700 (PDT)
    From: mary.jane@yahoo.com
    To: websupport@sampleoutdoors.com
    Subject: Problem with purchases
    Cc: customersupport@sampleoutdoors.com
    Mime-Version: 1.0
    Content-Type: text/plain; charset=us-ascii
    Content-Transfer-Encoding: 7bit
    X-From: mary jane
    X-To: websupport
    X-cc: customersupport
    X-bcc: 
    X-Folder: \websupport_July2012\Notes Folders\Inbox
    X-Origin: websupport
    X-FileName: websupport.nsf
    
    Hi
    I am unable to purchase on your website. Please help!!!
    
    Mary

    Batch_inbox also contains metadata.json as shown in Listing 2.
    Listing 2. metadata.json for the batch representing the inbox
    {preTimestampRegex:"((\\n)|(\\r))+Date:\\s", logType:"generic",
    batchId:"batch_inbox", dateTimeFormat:"EEE, dd MMM yyyy H:mm:ss Z", 
    missingDateTimeDefaults:[] }

    Notice that you will be using the logType generic initially, to identify any fields that may be extracted out of the box.

  4. Upload the data into the HDFS (Hadoop File System). Figure 1 shows the directory structure in HDFS after data is uploaded. You can use the BigInsights console to create the directory structure and upload the files in HDFS.
    Figure 1. Email data uploaded to HDFS
    Figure 1 shows the Email data that is uploaded to HDFS.

    Note: When the data is larger or in several batches, consider using the Import application shipped with IBM Accelerator for Machine Data Analytics.


Out-of-the-box support

Now you will run the Extraction application and validate the results. You can review the "A little preparation before getting started - The Known log types" and "The unknown log types" sections from Part 1: Speeding up machine data analysis of this series for more information on out-of-the-box support for various log types.

Perform the following steps.

  1. Run the Extraction application using the parameters shown in Figure 2. The Source directory is: /GOMDADemo/input_batches, and the output path is: /GOMDADemo/output/extract_out.
    Figure 2. Run the Extraction application on email data with generic logtype
    Run the Extraction application on email data with generic logtype

    Note: You should always point to the directory containing the batches for Source directory even if there is only a single batch under it. This allows the application to work on one or more batches at a time.

  2. Browse the contents of the output path. Follow the steps shown in Figure 3 to view the CSV result as a sheet. Then, save the workbook as email_generic.
    Figure 3. View output of Extraction as sheet
    View output of Extraction as sheet

    Note: The output directory contains a directory named after the batchID for each batch. A CSV file is named after the batchID containing the top 2000 results of extraction for each batch, which is the default setting.

    You can change the setting to extract all or none of the results in extract.config. The default configuration is installed at /accelerators/MDA/extract_config/extract.config, but you can make your own copies and save in other preferred locations.

  3. Validate the results of extraction, including validating that the primary timestamp and record boundaries are identified correctly. Also validate that the timestamp normalization is correct. All of the above is based on the values provided in metadata.json.

    Figure 4 shows the columns in the sheet resulting from batch_inbox.csv.

    Figure 4. Validate output of Extraction
    This figure shows the Validate output of Extraction.

    Note that the charset column is extracted from the generic rule for name value pair rule. Since there can be numerous name value pairs in a record, only the first value is exported in the CSV file. In the next article in this series, you will learn how you can visualize all the extracted fields in a search user interface.

    You will see a similar output for leaf tag pairs when data contains XML content. The first pair will be exported in the CSV file, but you can use the search interface to look at all of the results.

  4. If there are issues in validation, fix them now! When results are not produced as expected, it is worthwhile to double check the key information that drives those results.

    For fixing incorrectly identified record boundaries or incorrect values in LogDateTime, double check the primary timestamp format, represented as dateTimeFormat in metadata.json. Also check the regular expression preceding the timestamp, where applicable, represented as preTimestampRegex in metadata.json.

    For fixing incorrect values in LogDateTimeNormalized, in addition to the above, double check the missing information in the primary timestamp, represented as missingDateTimeDefaults in metadata.json, where applicable.

    If the expected fields are not seen in the headers of the CSV file representing the result, double check the log type selected, represented as logType in metadata.json.

    Notice that some of the interesting fields such as (To, From) in the emails were not extracted. This information is critical for analyzing emails.

    Next, you will customize the Extraction application so these interesting fields for the email data can be extracted.


Take control! Prepare for customization

Use the source code shipped as a BigInsights project for the Extraction application to provide richer support for the email data.

Perform the following steps.

  1. Open the Eclipse instance where you have installed BigInsights Eclipse tooling. To get more information on installing the BigInsights Eclipse tooling, open the BigInsights console and go to the Console tab. Then under Quick Links, select Enable your Eclipse development environment for BigInsights application development. Follow the instructions to install the Eclipse tooling.
  2. Locate the source project for the Extraction application (MDAExtractionApp_eclipse.zip) in your accelerator's installation directory, under the bin folder.
  3. Import the MDAExtractApp_eclipse.zip as an Eclipse project. First, open the BigInsights perspective. From the Window menu, select Open Perspective, then select BigInsights. If it does not show up in the list, it may be listed under Other.
  4. On the Project Explorer, right click the Import menu and select Existing Projects into workspace.
  5. In the wizard, select archive file. Point to MDAExtractApp_eclipse.zip.
  6. Click Finish.

You have imported the Extraction application source code. Next, you will get a brief overview of the structure of this project.


Understand the Extraction application

For an overview of the Extraction application, refer to the "Extracting information from text" section from Part 1: Speeding up machine data analysis of this series.

Perform the following steps.

  1. The Extraction application has a set of rules for a few known log types, and a generic set of rules for the lesser known log types. Expand the AQL folder to see the files containing these rules and how they are organized. Figure 5 shows how the AQL rules are organized for each of the log types.
    Figure 5. Peek into the Extract application
    This figure shows the aql rules organization

    Notice the naming convention used. Each log type has a corresponding extractor_logtype.aql file. This is mandatory to follow when you build the email logtype.

    Figure 5 also shows the extractor_logtype.aql files for the out-of-the-box log types. The extractor_logtype.aql is a top-level module that includes all of the rules for that log type. Typically, all of the rules for the log type are defined in a subdirectory.

  2. Take a look at the AQL rules under the common directory, which represent the generic set of rules. Expand the common directory and review the AQL rules, as shown in Figure 6.
    Figure 6. AQL rules included in the generic type
    This figures shows the common directory and the AQL rules included in the generic type

    Similarly, you can take a look at the rules included for the known types.

  3. The AQL rules are called by a custom JAQL module. To look at the compiled AQL rules included in the JAQL module, expand the src/jaql/custom_modules folder. Figure 7 shows the compiled rules that are exposed via the custom module.
    Figure 7. Compiled AQL rules exposed in the custom module
    This figures shows the compiled AQL rules that are exposed in the custom module
  4. Notice the resulting naming convention. Each log type has a corresponding extractor_logtype.tam file in the custom module/extractor folder. This is helpful to note when you build the email log type. The Extraction application contains several other JAQL scripts and Java UDFs, but they are not of interest when building a new log type.

By changing these AQLs, you can always customize any of the existing rules for the existing known and generic log types, or include any new rules for the existing known and generic log types.


Peek into the tooling

You will now take a quick look at how to use the tooling before you start making changes to the code.

Let's start simple. You will run the generic set of rules on one record of email data and observe the results.

Perform the following steps.

  1. Under the project, create a directory called data.
  2. Add a file called one_email. You can get it from the code_and_data.zip file in the Download section. The file contains one record of email data.
  3. Next, you will run the AQL rules for the generic set, as shown in Figure 8.
    1. Right click the project, and select Run As.
    2. Click Run Configurations.
    3. In the wizard, click Text Analytics.
    4. Click on the Launch New Configuration icon on the top left corner.
    5. Name the configuration as email_generic.
    6. Under modules, select extractor_generic.
    7. For Location of the Input document collection, select MDAExtractApp/data.
    8. Click Run.
    Figure 8. Run configuration to test generic AQL rules on email data
    This figure shows the run configuration to test generic AQL rules on email data
  4. Notice the results as shown in Figure 9.
    Figure 9. Result of generic AQL rules on email data
    This figure shows the result of generic AQL rules on email data

Note: The AQL rules to identify Date, Time and DateTime identify all occurrences of timestamps in a record. The primary timestamp in particular, represented by LogDateTime, is identified outside of these rules.


Create your own email log type

Follow a simple naming convention for the top-level text analytics module to create the new email type. You have already seen similar naming conventions being used by the out-of-the-box log types. Following this naming convention is important since it enables the new email log type to plug and play with the Extraction application and the rest of the applications in the Accelerator.

In the next article in this series, you will see how this new log type can plug and play during indexing and searching.

First you need to create a log type called email. Then, you need to create the top-level text analytics module called extractor_email.

Perform the following steps.

  1. Under the AQL folder, create a new folder called email.
  2. Download email.aql. You can get it from the code_and_data.zip file in the Download section. This contains the AQL rules for the email log type.
  3. Import the email.aql file into the email folder.
  4. Under the AQL folder, create a new folder called extractor_email.
  5. Download extractor_email.aql from the code_and_data.zip file in the Download section. This contains the top-level module to include the email rules from email.aql.
  6. Import the extractor_email.aql file into extractor_email folder. The new AQL rules are now in place for you to run them on email data.
  7. Run the MDAExtractApp project again using the Text Analytics configuration, this time selecting extractor_email for the module, as shown in Figure 10.
    Figure 10. Run project using email log type
    This figure shows how to run a project using email log type
  8. View the results in Annotation Explorer. You can also double click on the entries in the Annotation Explorer to see the results in context of the email data.
  9. You will notice that the To and From fields from the email data are extracted successfully, as shown in Figure 11.
    Figure 11. See results of running the email log type
    This figures shows the results of running the email log type

Understand the code

Review the code in email.aql and extractor_email.aql, shown in Listing 3.

Listing 3. email.aql
|--------10--------20--------30--------40--------50--------60--------70--------80--------|
module email;
                
create view email_base as 
select 
R.match 
from 
Regex(/([A-Za-z.\d;'\-<>]+\@[A-Za-z.\d;'\-<>]+(?=(?=,)\s*[A-Za-z.\@\d';\
-<>]+|))/, Document.text) R;
create view toEmail as 
select 
e.match
from
email_base e
where MatchesRegex(/To:\s/, LeftContext(e.match, 4));
                
create view To as 
select D.match as span, GetText(D.match) as text, GetString('To') as field_type
from toEmail D;

export view To;

create view fromEmail as 
select 
e.match
from
email_base e
where MatchesRegex(/From:\s/, LeftContext(e.match, 6));

create view From as 

select D.match as span, GetText(D.match) as text, GetString('From') as field_type
from fromEmail D;    

export view From;

Observe the following in the code.

  • A base view email_base was created.
  • A view fromEmail is created to pick emails representing the sender.
  • A view From is created the view fromEmail. This is the final view to export!
  • A similar toEmail view is created to pick emails representing the receiver.
  • A view called To is created in the view toEmail. This is the final view to export!

Note: The final views that are exported have the following simple naming convention that must be followed.

  • The view must contain a field called span, representing the span where the value is found.
  • The view must contain a field called text, representing the text value found.
  • The view must contain a field called field_type, representing the view.

You will soon get some insights into how these naming conventions are used, but first, you can publish the custom application.


Understand the wiring under the hood

Following the naming conventions for the exported views and the text analytics modules is important.

When the Extraction application sees a value of email in the logtype field in the metadata.json, it goes through the following process.

  • It looks for a compiled text analytics module with the name extractor_email in the custom JAQL module. When you build your new AQL, the compiled module will be automatically generated in this location. If it does not find one, then it will default to using extractor_generic.
  • After publishing, deploying and running the customized application, you will see the CSV file with sample results. The header values in the CSV file will represent each field_type field from each view exported from the extractor_email module.

In the next article, you will see that each field_type field from each view exported from the extractor_email module is also available for faceted searching.


Publish the customized application

The MDAExtractionApp is now ready to be deployed to the BigInsights cluster

Perform the following steps.

  1. Change the application name. This will avoid overwriting the installed Extraction application during deployment. Expand the BIApp folder, and open application.xml. Change the name to ExtractionEmail as shown in Figure 12.
    Figure 12. Change application name
    This figures shows the change application name

    The application is now ready to be published. Next, point the tooling to the BigInsights cluster.

  2. If you have not already done so, ensure you have set the Text Analytics tooling to use a standard tokenizer as opposed to multilingual.
    1. From Windows -> Preferences, expand BigInsights and select Text Analytics.
    2. Select Show advanced tab in Text Analytics project property page as shown in Figure 13.
      Figure 13. Text Analytics settings
      shows the above settings selected
    3. Right-click on the project MDAExtractApp and select Properties.
    4. For Text Analytics, select Use the default Standard tokenizer configuration as shown in Figure 14.
      Figure 14. Use standard tokenizer configuration
      shows default standard tokenizer selected
  3. From the BigInsights servers view, add your BigInsights server by right clicking BigInsights servers, selecting New, and providing information in the wizard as shown in Figure 15.
    Figure 15. Add BigInsights server
    Add BigInsights server
  4. Click MDAExtractionApp and select BigInsights Application Publish.
  5. In the wizard, keep the defaults and select Next through the first five pages.
  6. On the page Zip and publish applications, click Add JAQL module.
  7. Select custom_modules in the src/jaql folder.
  8. Select Finish. You will see a message indicating the application was published successfully.

New log type in action!

You will now run the ExtractionEmail application.

Perform the following steps.

  1. First, change the prepared batch to indicate log type as email, as opposed to generic. To do this, you will change the metadata.json file. From the BigInsights console, Files tab, click GOMDADemo/input_batches/batch_inbox/metadata.json.
  2. Click the Edit button, change the log type to email, and click Save. The metadata.json should now look as shown in Listing 4.
    Listing 4. metadata.json
    |----10----20----30-----40-----50-----60-----70-----80
    {preTimestampRegex:"((\\n)|(\\r))+Date:\\s", logType:"email",\
    batchId:"batch_inbox", 
    dateTimeFormat:"EEE, dd MMM yyyy H:mm:ss Z", \
    missingDateTimeDefaults:[] }
  3. Next, deploy the newly published application. From the Applications tab, click Manage, select ExtractionEmail and click on Deploy, as shown in Figure 16.
    Figure 16. Deploy ExtractionEmail App
    This figure shows the Deploy ExtractionEmail Application.
  4. Execute the ExtractionEmail application, similar to steps shown in Figure 2, except this time providing the source directory as: /GOMDADemo/input_batches, and the output path as: /GOMDADemo/output_email/extract_out.
  5. Click Run.
  6. To view the results, from Application History, go to the icon in the column for Output. It will take you to /GOMDADemo/output_email/extract_out/batch_inbox.csv. You will see columns with To and From with the To and From emails, as shown in Figure 17.
    Figure 17. View result of customized ExtractionEmail application
    This figures shows you the view result of customized ExtractionEmail App

You have successfully customized the Extraction application to add the ability to extract information from email data.


Conclusion

In this tutorial, you created a completely new log type to support email data. You can also add any of the existing rules to this log type to enrich it further!

At the Sample Outdoors company, Extraction configuration was changed to export all records to the CSV file as opposed to the top 2000. Further ad-hoc analysis was performed combining customer order information with email information to identify customers to follow up for remediation.

Acknowledgements

Thanks to Amit Rai (amitrai4@in.ibm.com) for his technical review, and to all the Machine Data Accelerator team members contributing to this feature. Also thanks to Thomas Friedrich and Robin Noble-Thomas for their help on BigInsights tooling.


Download

DescriptionNameSize
Code samplecode_and_data.zip---

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=854936
ArticleTitle=IBM Accelerator for Machine Data Analytics, Part 2: Speeding up analysis of new log types
publish-date=01172013