IBM Accelerator for Machine Data Analytics, Part 4: Speeding the up-and-running experience for a variety of data

Machine logs from diverse sources are generated in an enterprise in voluminous quantities. IBM® Accelerator for Machine Data Analytics simplifies the task of implementation required so analysis of semi-structured, unstructured or structured textual data is accelerated. In this article, the fourth in a series, learn step-by-step how to use the web or Eclipse tooling in IBM InfoSphere® BigInsights™ to more quickly get up-and-running with IBM Accelerator for Machine Data Analytics.

Share:

Sonali Surange (ssurange@us.ibm.com), Architect, Machine Data Accelerator, IBM

Author photoSonali Surange is an IBM software architect working on IBM's big data products and technologies. She has filed numerous patents, published more than 15 technical papers with IBM developerWorks, and presented in numerous technical conferences. She is a past recipient of the IBM Outstanding Technical Achievement Award, Women of Color STEM Technical All Star Award, and was recognized as an IBM developerWorks Professional Author in 2012.


developerWorks Professional author
        level

Amit Rai (amitrai4@in.ibm.com), Software Engineer, IBM

Photo of Amit RaiAmit Rai is an Advisory Software Engineer on the IBM Big Data Accelerator team. He played a key role in development of the Machine Data Analytics Accelerator. Amit is engaged in proof-of-concept projects with various customers involving InfoSphere BigInsights, InfoSphere Streams, and Data Explorer. He also has worked extensively on many IBM data management and data warehousing solutions.



28 February 2013

Also available in Chinese

Machine data characteristics

In Part 1: Speeding up machine data analysis of this series, you learned how machine data consists of records. In many cases records consist of one line and in other cases numerous lines together form a record. Machine logs that contain exception stack traces, XML content, or content generated from an application writing multiline records are typical examples. Record boundaries are usually identified with the presence of a primary timestamp. Within a record, sometimes some characters precede the primary timestamp.

Refer to A little preparation before getting started, The known log types and The unknown log types from Part 1: Speeding up machine data analysis for a few such examples.

Identifying and defining these record boundaries correctly is an important first step in performing machine data analysis. Whether the machine data consists of single line or multiple line records, following this process helps identify the primary timestamp which is key to the rest of the analysis.

As the data varies, rules describing record boundaries or primary timestamps can vary slightly or need to be redefined. The task of preparing a variety of types can be simplified with the help of tooling.


Before you start

About this series

One of the primary advantages and strengths of IBM's accelerator for Machine Data Analytics is the capability and ease with which the tools can be configured and customized. This series of articles and tutorials is for those who want to get an introduction to the accelerator and further accelerate the analysis of machine data with the idea of ' getting custom insights.

About this tutorial

This tutorial is a step-by-step example showing how to use the IBM InfoSphere BigInsights tooling (web or Eclipse) towards speeding the up-and-running experience for IBM Accelerator for Machine Data Analysis. You will learn how to easily prepare the data and iteratively test the extraction of the data. This will set the stage for the rest of the analysis. During the process, you will be introduced to some helper tooling which can be used to speed up the process.

Objectives

In this tutorial:

  1. You will learn how to configure your machine data for analysis. You will be introduced to BigInsights Eclipse helper tooling which you can optionally use.
  2. If you prefer to configure and test your data locally before moving to the BigInsights cluster, you will learn how to use the Eclipse tooling to perform this task.
  3. If you prefer to configure and test directly on your BigInsights cluster you will learn how to perform this task.

As a variety of data is used for analysis, use these steps on small amounts of data to prepare it for analysis. Once tested, you can run the analysis on big data with a similar configuration.

Prerequisites

Read Part 1: Speeding up machine data analysis of this series to get an overview of the IBM Accelerator for Machine Data Analytics. Optionally read Part 2: Speeding up analysis of new log types to understand how Eclipse tooling is used to support a new log type and Part 3: Speeding up machine data searching to search known and customized log types from a consolidated searchable repository.

System requirements

To run the examples in this tutorial, you need:

  1. InfoSphere BigInsights 2.0 installed
  2. IBM Accelerator for Machine Data Analytics installed
  3. Optionally, BigInsights 2.0 tools for Eclipse installed
  4. A data set for machine data analysis. Refer to the Download section for the link to download the data.

The situation at a fictitious Sample Outdoors Company

The data scientists at the Sample Outdoors Company were given the mission to evangelize IBM Accelerator for Machine Data Analytics to a great number of new organizations, each organization having its own log formats. They anticipated a variety of logs to be prepared for analysis. They decided to use BigInsights tooling to speed up the preparation and testing of the data for analysis. Once prepared, they will use these configurations for regular ongoing analysis.


Accelerating the up-and-running experience for machine data analysis

In the past tutorials and articles in this series, you have used previously prepared batches of data that were made available for download. In this tutorial, you will prepare a batch of data. Preparing the batches consists of identifying the record boundaries and the primary timestamp, and creating rules to define them. This information is then used to create the metadata for the batch. Finally, you will test the prepared batch.

Here are the steps you'll follow in this article:

  1. Review the process to identify record boundaries.
  2. Use BigInsights Eclipse tooling, if desired, to provide the first rule. It represents the string before the primary timestamp. If you do not need tooling help building a regular expression or using Eclipse tooling is not of interest, proceed to provide the second rule.
  3. provide the second rule. It represents the primary timestamp.
  4. Put the rules together to form the metadata for this type of log.
  5. If you prefer to test with small data on Eclipse locally before moving to the BigInsights cluster, Test rules on small data locally using Eclipse.
  6. Review Tips on iterative testing and troubleshooting using Eclipse tooling.
  7. If you prefer to test small data on your BigInsights cluster, Test rules on small data using the BigInsights console.
  8. Review Tips on iterative testing and troubleshooting using the BigInsights console.
  9. Understand the wiring under the hood.
  10. Get running on big data.

At the Sample Outdoors company

The data scientists at the Sample Outdoors Company acquired machine data from a front-end application from the web tools group as an exercise to use the tooling. Next, they wanted to prepare the data for analysis.


Identify record boundaries

Record boundaries consist of two parts:

  • The primary timestamp, which should be provided in Java SimpleDateFormat.
  • The string before the primary timestamp which should be provided in the form of a regular expression.

You will review this process with the help of an example of an apache web access log.

Steps

  1. Download the data.zip from the Downloads section. Unzip it.
  2. Review the first few lines of the downloaded data/log.txt file.
    Listing 1. Machine data
    9.11.245.205 - - [23/Jul/2012:23:54:24 -0400] "GET 
    /innovation/us/watson/images/arrows/bg-nav-item-arrow-gray.png?1311756764 
    HTTP/1.1" 304 - "-" "Mozilla/4.0 (compatible;)"8308
    8.12.248.208 - advertising [23/Jul/2012:23:57:47 -0400] "GET 
    /innovation/us/watson/fonts/helveticaneue-webfont.eot HTTP/1.1" 200 2355 
    "http://www-03.ibm.com/innovation/us/watson/what-is-watson/a-system-designed-for
    -answers.html" 
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR
     2.0.50727; 
    .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; Tablet 
    PC 2.0)"2695
  3. Notice that there are two records in the above sample. The primary timestamps for the two records are 23/Jul/2012:23:57:47 -0400 and 23/Jul/2012:23:54:24 -0400 respectively.
  4. Notce that the log records are don't start with the primary timestamp. There is an IP address before it. Sometimes there is a hostname but its is missing most of the time. The strings before the primary timestamps in the records are:
    9.11.245.205 - - [
    8.12.248.208 – advertising [

Next, you will build the two rules representing the record boundaries.


Provide the first rule

The first rule consists of the regular expression for the string before the primary timestamp. You will use the BigInsights Eclipse tooling to help build this regular expression. If do not need tooling help to build a regular expression or using Eclipse tooling is not of interest, proceed to Provide the second rule.

Steps

  1. From your Eclipse where BigInsights tooling is installed, click on the Open Regular Expression Generator Wizard icon as shown in Step 1 in Figure 1 below.
  2. In the wizard, click on the Load Samples from file button as shown in Step 2 in Figure 1. Select the downloaded data/sampleforpretimestampregex.txt file. This file contains a list of samples representing the string before the primary timestamp.
  3. Select the radio button that shows this value: (\d{1,2}(\.)\d{2}\d?(\.)\d{2}\d?(\.)(\d{2,3}|\d{1,2})(\d{1,2})?( )(-)( )((advertising)|(-))( )(\[)) as shown in Step 3 of Figure 1.
  4. Click the Next button.
    Figure 1. Generate regular expression wizard, page 1
    Image shows steps given above

    Click to see larger image

    Figure 1. Generate regular expression wizard, page 1

    Image shows steps given above
  5. The default proposed regular expression is shown in Figure 2.
    Figure 2. Default proposed regular expression
    shows the regular expression with samples
  6. Next, you'll improve the regular expression, giving some known rules about IP addresses, such that the digits can have values between 0-255.
    1. In the proposed regular expression, select \d{1,2}, and select the radio button for An Integer number within a certain range.
    2. Provide a minimum value of 0 and maximum value of 255 to represent rules for an IP address, as shown in Step 1 in Figure 3.
    3. Select Apply to replace the first highlighted rule.
    4. Follow this procedure iteratively through all the rules for the digits and apply similar changes for each rule.
    5. You now have a regular expression that represents only valid digits for IP addresses.
    Figure 3. Improve the rules for digits in IP address regular expressions
    shows steps as given above for regular expression generator
  7. Lastly, you will generalize the rule so that it represents the right valid string as opposed to "advertising" that occurs in the samples.
    1. Select the Next button until the regular expression for the word "advertising" is highlighted, as shown in Step 1 in Figure 4.
    2. Select the radio button for Any of the symbols in this character class, where the character class is Lower case letters.
    3. Provide the values of 0 and 50 for the least and most characters respectively, as shown in Step 2 in Figure 4.
    4. Click the Apply button as shown in step 3 in Figure 4.
    5. The generalized regular expression representing all possible strings will be shown as \p{Ll}{0,50}.
    Figure 4. Generalize the regular expression
    Regular expression generator showing steps listed above

    Click to see larger image

    Figure 4. Generalize the regular expression

    Regular expression generator showing steps listed above
  8. Next, you will test the generated regular expression on the full log records.
    1. Click on the Import button as shown in Step 1 in Figure 5.
    2. Select the downloaded data/log.txt.
    3. Notice that the regular expression correctly matches all the strings.
    4. Click Finish.
    Figure 5. Test regular expression on full records
    shows steps for testing regular expressions as listed above

    Click to see larger image

    Figure 5. Test regular expression on full records

    shows steps for testing regular expressions as listed above
  9. You now have the value for the first rule — the string before the timestamp, to put into metadata.json. Save the contents from the clipboard in a file. You will use it in Put the rules together to create the metadata.

Use tooling to generate regular expressions when building or customizing your AQL rules. In Part 2: Speeding up analysis of new log types, you downloaded pre-built AQL rules developed for the new log type. As an extended exercise, customize or add your log types and use tooling to build your AQL rules!


Provide the second rule

The second rule represents the primary timestamp format.

Steps

  1. Refer to SimpleDateFormat to identify the format for the primary timestamp.
  2. The format to represent 23/Jul/2012:23:54:24 -0400 would hence be: dd/MMM/yyyy:HH:mm:ss Z.

You now have the value for the second rule — the primary timestamp. You are now ready to create the metadata to prepare the batch.


Put the rules together

Since you only have one batch for this exercise, you will create the metadata.json file as shown.

Steps

For any different log:

  • Simply change the first and second rules.
  • Pick a log type, or use generic.
  • Add any other metadata, if you need it.

That's it!

  1. Using your favorite editor, create a file called metadata.json, with content as shown in Listing 2.
  2. Note that before using the first rule generated for the preTimestampRegex field, we made the following changes:
    • Pre-pended the rule with ((\\n)|(\\r))+ to include the newlines before record boundaries.
    • In the regular expressions, escape \ with \\.
  3. We used the second rule for the dateTimeFormat field.
  4. The logtype field used is webaccess.
  5. Additional information about application name was added in the application field.
Listing 2. Contents of metadata.json
{logType:"webaccess", batchId:"batch_tools", 
dateTimeFormat:"dd/MMM/yyyy:HH:mm:ss Z", 
preTimestampRegex:"((\\n)|(\\r))+((25[0-5])|(2[0-4][0-9])|
(1[0-9][0-9])|([1-9][0-9])|([0-9]))(\\.)((25[0-5])|(2[0-4]
[0-9])|(1[0-9][0-9])|([1-9][0-9])|([0-9]))(\\.)((25[0-5])|
(2[0-4][0-9])|(1[0-9][0-9])|([1-9][0-9])|([0-9]))(\\.)
((25[0-5])|(2[0-4][0-9])|(1[0-9][0-9])|([1-9][0-9])|([0-9]))( )
(-)( )(\\p{Ll}{1,100}|(-))( )\\[", patternFieldParam:[], 
missingDateTimeDefaults:[], application:"toolsFrontEnd"}

Test rules on small data locally using Eclipse

You will be testing the rules on a small set of data using Eclipse. If you prefer to do this testing on the BigInsights Console tooling on the web, proceed to Test rules on small data using the BigInsights Console. To test the rules, you will use the Extraction App, which will perform record splitting, field extraction, timestamp normalization, and metadata synthesis, and will provide results for validation. Once the results are verified, you can use these rules on bigger data on the BigInsights cluster.

Steps

  1. Set up Eclipse and the Extraction App using Take Control! Prepare for customization from Part 2: Speeding up analysis of new log types, if you have not already done so.
  2. If you have not already done so, ensure you have set the Text Analytics tooling to use a standard tokenizer as opposed to multilingual.
    1. From Windows -> Preferences, expand BigInsights and select Text Analytics.
    2. Select Show advanced tab in Text Analytics project property page as shown in Figure 6.
      Figure 6. Text Analytics settings
      shows the above settings selected
    3. Right-click on the project MDAExtractApp and select Properties.
    4. For Text Analytics, select Use the default Standard tokenizer configuration.
      Figure 7. Use standard tokenizer configuration
      Image shows default standard tokenizer selected
  3. Switch off JAQL editor errors, since you will not be requiring that feature in this tutorial. From the Preferences menu, select BigInsights and uncheck Show JAQL errors checkbox.
  4. Create a preferred directory on your machine to hold the batch of logs. For this exercise, we will use C:/GOMDADemo/tools/input_batches/batch.
  5. Copy log.txt and metadata.json into the batch directory.
  6. Copy the downloaded data/config/extract.config to c:/GOMDADemo/tools.
  7. If you are running your Eclipse instance from a Windows® machine, continue with this step. Otherwise if you are running from Eclipse on a Linux® machine, proceed to step 8.
    Open src/jaql/custom_modules/LAExtract/LAExtract.jaql and comment the getModuleDirectory method. Add the method as shown in Listing 3, which lists a version of getModuleVersion that is needed to run from Windows.
    Listing 3. getModuleVersion to run from Windows
    getModuleDirectory = fn() (
        if (startsWith(::_moduleDirectory, "file:")) (
            simpleStrReplace(::_moduleDirectory, 'file://', 'file:///')
        ) else (
            "file:" + ::_moduleDirectory
        )
    );
  8. You are now ready to run extraction. Since the extraction application uses JAQL custom modules, you will use JAQL tooling to test the app.
    1. Right click on MDAExtractApp and select Run As. Select JAQL and click on the New Configuration icon.
    2. In JAQL Script select src\jaql\app\extractOverMultipleBatches.jaql.
    3. Select the Local radio button.
    4. In JAQL Search path, select the Add Default Path button.
    5. In the JAQL Search path, select the Add from project button and select src/jaql/custom_module, as shown in Figure 8.
      Figure 8. Configuration to run as JAQL
      screen cap shows JAQL main settings as listed above
    6. In the Arguments tab, provide values shown in Listing 4.
      Listing 4. Arguments to run JAQL module
      -e "INPUT_LOG_FILES_DIR_PARM=\"c:/GOMDADemo/tools/input_batches 
      \";OUTPUT_BASE_DIR_PARM=\"file:///c:/GOMDADemo/tools/output/jaql/extract_out 
      \";PATTERN_FIELDS_PARM=\"LogDateTime\";NUM_REC_PARM=\"Top 
      2000\";EXTRACT_CONFIG_DIR_PARM=\"c:/GOMDADemo/tools/extract.config\";"
    7. Click the Run button on the dialog.
  9. To see the results, go to the output location c:/GOMDADemo/tools/output/jaql/extract_out and open batch_tools.csv. Verify values for the primary timestamp represented by LogDateTime, normalized timestamp represented by LogDateTimeNormalized and the record represented by text are correct. The rest of the fields are specific to the log type selected, and the metadata passed in, such as application. Figure 9 shows the result:
    Figure 9. Verify result of sample data extraction
    screen cap shows LogDatetimeNormalized

Tips on iterative testing and troubleshooting using Eclipse tooling

Keep these tips handy when you try this with new logs.

  1. Set up Eclipse logging using the instructions in the InfoSphere BigInsights Information Center.
  2. Keep the following settings and change the others as listed in the link above:
    • jaql.root.logger.level=ERROR
    • jaql.status.logger.level=ERROR
  3. All errors will be displayed in the Eclipse console.
  4. To fix any issues with rules, change the values in metadata.json and re-run the extraction app.

Test rules on small data using the BigInsights Console

You can use the BigInsights Console web tooling to test with small data, before running with big data.

Steps

  1. Create a directory /GOMDADemo/tools/input_batches/batch in HDFS.
  2. Upload log.txt into the batch directory.
  3. Upload the metadata.json created in Put the rules together into the batch directory.
  4. Run the extraction application with the following parameters:
    • Source directory - /GOMDADemo/tools/input_batches
    • Output path - /GOMDADemo/tools/output/console/extract_out
    Keep all the other defaults and select Run.
  5. To see the results, go to the output location /GOMDADemo/tools/output/console/extract_out and click on batch_tools.csv.
    • >Select the Sheet radio button and select Comma separated file for reader.
    • Verify values for the primary timestamp represented by LogDateTime, normalized timestamp represented by LogDateTimeNormalized and the record represented by text are correct. The rest of the fields are specific to the logtype selected and the metadata passed in, such as application. Refer back to Figure 9 for the result.

Tips on iterative testing and troubleshooting using BigInsights Console

Keep these tips handy when you try this with new logs:

  • Errors will be logged in /GOMDADemo/tools/output/jaql/extract_out/__temp.
  • To fix any issues with rules, simply change the values in metadata.json and re-run the extraction app.

Understanding the wiring under the hood

The extraction app contains a custom JAQL module. The JAQL module in turn performs record splitting, field extraction using the rules written in AQL, and timestamp normalization, among other things.

The BigInsights text analytics tools allow you to test the AQL field extraction rules on one record at a time. You used this in Peek into the tooling in Part 2: Speeding up analysis of new log types.

Running the extraction app from the BigInsights console web tooling or Eclipse using a JAQL configuration allows you test the entire flow of the extraction app. The extraction application performs record splitting, field extraction, primary timestamp normalization, and metadata synthesis, which are the building blocks for further analysis.

You are now ready to use the extraction app installed with the Machine Data Accelerator on bigger data on the cluster.


Running on bigger data

If you tested on Eclipse locally, to perform analysis on bigger data, first copy the metadata over as described in Test rules on small data using BigInsights Console. Add bigger data as opposed to small data to your batch and use the extraction, index, and other applications already installed with IBM Accelerator for Machine Data Analytics from your BigInsights cluster. Refer to Part 3: Speeding up machine data searching.

To run analysis regularly as more data of the same type becomes available, create a template with this metadata and use it to run MDAGenerateMeta.sh. To get more information on MDAGenerateMeta.sh refer to A little preparation before getting started in Part 1.


Summary

In this tutorial you've learned how you can use tooling to prepare a web access type of log for analysis. You learned about the choices available to perform the preparation and iterative testing, using Eclipse locally or using the tooling on the BigInsights console.

At the Sample Outdoors Company, along with the out-of-the-box log types, numerous new log types were analyzed using the procedures above to speed up preparation of the data for analysis. In addition, the above tooling was used when they created their own log types as shown in Part 2: Speeding up analysis of new log types.


Acknowledgment

Thanks to Thomas Friedrich and Laura Chiticariu for their technical reviews. Thanks to Robin Noble-Thomas for her help on BigInsights tooling and Marcel Kutsch for his help with JAQL. Thanks to all the Machine Data Accelerator team members who contributed to this feature.


Download

DescriptionNameSize
Data files for this tutorialdata.zip2.17KB

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=859691
ArticleTitle=IBM Accelerator for Machine Data Analytics, Part 4: Speeding the up-and-running experience for a variety of data
publish-date=02282013