IBM Accelerator for Machine Data Analytics, Part 3: Speeding up machine data searching

Machine logs from diverse sources are generated in an enterprise in voluminous quantities. IBM® Accelerator for Machine Data Analytics simplifies the task of implementation required so analysis of semi-structured, unstructured or structured textual data is accelerated.

Share:

Sonali Surange (ssurange@us.ibm.com), Software Architect, IBM

Author photoSonali Surange は、IBMのビッグデータ製品とテクノロジーを担当するソフトウェア・アーキテクトです。数多くの特許登録を行っており、IBMのdeveloperWorks上で15件の論文を発表し、数多くの技術カンファレンスで発表を行っています。SonaliはこれまでIBM Outstanding Technical Achievement AwardとWomen of Color STEM Technical All Star Awardを受賞し、2012年にはIBM developerWorks Professional Authorとして表彰されました。


developerWorks Professional author
        level

31 January 2013

Also available in Chinese Japanese

Before you start

About this series

One of the primary advantages and strengths of IBM Accelerator for Machine Data Analytics is the capability and ease with which the tool can be configured and customized. This series of articles and tutorials is for those who want to get an introduction to the accelerator and further accelerate the analysis of machine data with the idea of getting custom insights.

About this tutorial

In Part 1 of this series, you looked at some known logs and some lesser known logs. In Part 2 of this series, a new log type to analyze a new data type is created. In this tutorial, you will see how the new email log type will plug and play just like the out-of-the-box and generic types. You will also get a consolidated view of all these logs and an ability to search across them.

If new log types are not of interest, you will learn how the out-of the box and generic types can be used for searching.

Objectives

In this tutorial, you will learn how to do the following.

  1. Use the out-of-the-box log types in indexing and searching.
  2. Plug and play customized log types in indexing and searching.
  3. Observe how facets are automatically discovered for out-of-the-box and customized log types.
  4. Configure indexing and searching to match the use case.

You will also learn how to use the Application chains shipped with the accelerator.

Prerequisites

Read Part 1: Speeding up machine data analysis of this series to get an overview of the IBM Accelerator for Machine Data Analytics. Optionally complete Part 2: Speeding up analysis of new log types of this series, if you would like to learn how to customize the accelerator for new log types.

System requirements

To run the examples in this tutorial, you need the following.

  1. BigInsights v2.0 installed.
  2. IBM Accelerator for Machine Data Analytics installed.
  3. A data set for machine data analysis. Refer to the Download section for the link to download the data.

Consolidated viewing and searching across all logs

Machine data comes in all shapes and sizes. Some types of data formats follow known structures or formats while other types of data have completely custom formats. Some types are semi-structured or unstructured while others are structured.

Bringing all types of data together in a consolidated view for searching provides significant benefit in any type of analysis. While some of the machine logs can provide information about the application behavior, combining it with unstructured information like emails can help provide actionable analysis. Combining this further with structured information from configuration files or reports from external systems leads to a searchable gold mine of information.

In Part 1: Speeding up machine data analysis of this series, you saw the variety in machine data across application layers. In Part 2: Speeding up analysis of new log types of this series, you saw how external information like emails can be easily added for analysis.

In this tutorial, you will bring all this data together in a searchable repository.


The situation at a fictitious Sample Outdoors company

The Sample Outdoors company wanted to get a consolidated view of all their log data. In addition, they wanted to start creating their searchable gold mine of information by adding email data to it. The next mission for the Sample Outdoors company was to bring all of this information together and search through it.


Ten features to search any machine data

Take a look at the following overview and highlights of the features of IBM Accelerator for Machine Data Analytics that can be used to search any machine data.

  1. Bring in and extract the data using the Import-Extract chain.
  2. Create a consolidated searchable repository of all logs.
  3. Add custom log types to the repository and plug and play new log types in searching.
  4. Prepare for search, observe automatic discovery of facets.
  5. Observe the chronological view of events including emails.
  6. Perform the search.
  7. Show only the facets that are meaningful to the use case. Learn how to accomplish this in the Configure the user interface for your use case section.
  8. Add any missing fields from custom log types in the Configure the index to add fields from custom log types section.
  9. Finally, learn how to optimize the index size by creating only the necessary facets in the Optimize! Burn configuration into the index section.
  10. View the resulting Search interface.

At the Sample Outdoors company

The data scientists at the Sample Outdoors company had the following machine data from the customer.

  • CustomerFrontEnd application – An Apache web access-based application
  • CustomerBackend application – An IBM WebSphere server-based application
  • CustomerDbApp application – An Oracle database application

They also had emails sent to customersupport@sampleoutdoors.com and websupport@sampleoutdoors.com.

They wanted to bring all this information together to start building a consolidated searchable repository.


Bring in and extract the data

In this section, you will bring in the machine data from the application stack into the repository. Refer to Part 1: Speeding up machine data analysis for more information on the applications and their logs.

Prepared batches of logs from the application stack are provided in the Download section.

Perform the following steps.

  1. From the Download section, download data_and_config.zip and unzip it.
  2. Copy data/input_batches to a machine on your BigInsights cluster. For this tutorial, you will use the location /opt/ibm/input_batches. You can always change it to another preferred location. Notice the directory structure containing the batches. Input_batches contains the following three batches that represent the three layers of application stack.
    • Batch_webaccess – Containing logs from the web access layer.
    • Batch_was – Containing logs from the WebSphere application.
    • Batch_oradb – Containing logs from the Oracle database layer.
  3. You will use the Import-Extract application chain to perform the Import and Extraction steps in one shot. Since the Import application uses Distributed Copy application, first ensure that Distributed Copy application is deployed.
    • From the BigInsights console, click the Applications tab and select the Manage link.
    • In the edit box, type Distributed. You will notice the Distributed File Copy application is found and listed under Applications.
    • If the application status is NOT_DEPLOYED, then click the Deploy button as shown in Figure 1.
      Figure 1. Deploy Distributed Copy application
      This figure shows the deploy Distributed Copy application
  4. You are now ready to use the Import-Extract application chain to run the Import and Extraction applications in one shot. From the BigInsights console, click the Applications tab and select the Tree view icon.
  5. Select the Import-Extract application. Provide inputs and outputs. You can choose to use ftp or sftp protocol for file transfer. The following steps use sftp.
    • Import input path – sftp://<server>/opt/ibm/input_batches.
    • Import output path – /GOMDADemo/search/input_batches.
    • Credentials file – When using ftp, keep the default value NOT_SET.

      When using sftp, create a file containing the contents shown in Listing 1 and save it in HDFS at /user/biadmin/credstore/public/<filename>.

      Listing 1. Credentials store file
      password=your_sftp_userpassword
      username=your_sftp_userid

      Provide the location of this file for Credentials file /user/biadmin/credstore/public/<filename>.

    • Extract output Path - /GOMDADemo/output/extract_out.
    • Extract configuration file path – Keep the default value /accelerators/MDA/extract_config/extract.config.
  6. If you have already completed Part 2: Speeding up analysis of new log types of this series, you will have a directory batch_inbox at the location pointed to by the output path - /GOMDADemo/output/extract_out. Running the previous steps will incrementally add the new batches, batch_webaccess, batch_was and batch_oradb to the same location. Figure 2 shows the results of successful completion of the Import Extract chain.
    Figure 2. Run the Import Extract chain
    This figure shows the run the Import Extract chain
    If you have not completed Part 2: Speeding up analysis of new log types, you will not see the batch_inbox directories in your result. You will be adding that information in the Plug and Play new log types in searching section of this tutorial.

Consolidated searchable repository of all logs

Next, you will index the data into a consolidated searchable repository.

Perform the following steps.

  1. From the BigInsights console, click the Applications tab and select the Tree view icon.
  2. Expand the Machine Data Analytics: Search folder and select the Index application. Provide the following inputs and outputs.
    • Source directory: /GOMDADemo/output/extract_out.
    • Output path: /GOMDADemo/output/index_out.
  3. Browse to the location pointed by the output path. Click on batch_list.json. You will notice batch_webaccess, batch_was, batch_oradb in the list. This confirms that you now have a searchable repository of indexed data that contains log information from the Web Access, WebSphere and Oracle database application layers.
  4. If you have already completed Part 2: Speeding up analysis of new log types of this series, you will also have batch_inbox in the list. You now have email data in this searchable repository!
  5. If you have not completed Part 2: Speeding up analysis of new log types, you will not see the batch_inbox listed in the result. You will be adding that information in the Plug and Play new log types in searching section of this tutorial.

    Figure 3 shows the results of successful execution of the Index application.

    Figure 3. Run the Index application
    This figure shows the run the Index application

    If you already successfully added email data in the previous step, or if working with a new custom type is not of interest for your use case, you can skip the Plug and Play new log types in searching section of this tutorial and jump to the Search! section.


Plug and play new log types in searching

You can add new data to the repository anytime! In this tutorial, you will use the pre-extracted email data which is provided in the downloads section.

Perform the following steps.

  1. In HDFS, click the Create Directory icon under the Files tab to create a directory batch_inbox under /GOMDADemo/output/extract_out.
  2. Copy the contents of the downloaded data_and_config/data/extract_out and data/extract_out/batch_inbox over to /GOMDADemo/output/extract_out and /GOMDADemo/output/extract_out/batch_inbox respectively in HDFS. Now you are ready to index the email data.
  3. From the BigInsights console, click the Applications tab and then click the Tree view icon. Expand the Machine Data Analytics: Search folder and select the Index application. Provide the following inputs and outputs.
    • Source directory - /GOMDADemo/output/extract_out.
    • Output path - /GOMDADemo/output/index_out.

    This will incrementally add the new email batch to the index repository already containing data from the Web Access, WebSphere and Oracle database layers.

  4. Browse to the location pointed by the output path and click batch_list.json. batch_list.json contains the list of all the batches that have been successfully indexed and available in the searchable repository.
  5. You will notice batch_inbox along with batch_webaccess, batch_was, batch_oradb in the list. This confirms that you now have a searchable repository of indexed data that contains log information from the Web Access, WebSphere, Oracle database application layers and the email data.

Prepare for search, observe automatic discovery of facets

Copy the index from HDFS to the machine running the console so that it can be available for searching.

Perform the following steps.

  1. Start a command line session to the machine running the console.
  2. Log in as BigInsights administrator user, default is biadmin.
  3. Go to the bin directory under your accelerator install location.
  4. Run the command shown in Listing 2.
    Listing 2. Run copyIndex utility
    [user@server bin]$ ./copyIndex.sh 
    -hdfsIndexDir=hdfs://bdvm235.svl.ibm.com:9000/GOMDADemo/output/index_out
                            
    INDEX_DIR = /opt/ibm/accelerators/MDA/mda_indexes
                            
    copying indexes from hdfs.
                            
    Indexes successfully copied to local file system.
                            
    MDA UI can be accessed at for secure install
    'http://<hostname>:8080/datasearch/login.jsp'.
                                
    MDA UI can be accessed at for non-secure install
    'http://<hostname>:8080/datasearch/html/Search.html'.
  5. You are now ready to search. Open a browser instance and use the appropriate URL as described in the output of the copyIndex.sh utility.
  6. The search interface shows a time graph with a high-level view of the timeframe when all the events occurred. Hovering on each of the bars in the graph shows the number of events in each timeframe. You can shrink down the data into a timeframe of interest by clicking on the bar representing that timeframe. You will be doing this in the next step.

    Figure 4 shows the time graph.

    Figure 4. High-level time graph
    This figure shows the high-level time graph
  7. The facets on the left are discovered based on the extracted fields across all the data. Note the facets From and To from the email log type.

    Figure 5 shows facets across all of the data.

    Figure 5. Facets across all data
    This figure shows the facets across all data

    The data scientists at the Sample Outdoors company noticed several facets that they would not use in their use case.

    You will learn how the user interface can be configured for your use case in the Configure the user interface for your use case section of this tutorial.

  8. The search results list events across all of the data. Figure 6 shows the results.
    Figure 6. Results listing all events
    This figure shows the results listing all events

    Next, you will experience the shrink down and drill down into the data.


Chronological view of events including emails

You will now zoom in and view events and emails occurring on Sat July 14th.

Perform the following steps.

  1. From the browser instance pointing to the search interface, click on the tallest bar on the graph representing the time frame 15 hours to 15:59 hours, and then into the time frame representing 3:58.00 to 3:58.59 P.M., as shown in Figure 7.
    Figure 7. Shrink down to July 14th between 3 and 4 pm
    Shrink down to July 14th between 3 and 4 pm
  2. Observe the events in the results, as shown in Figure 8.
    Figure 8. Events including emails
    Events including emails

Experience how you can combine text search, faceted search and time-based searches across your data, by way of answering some simple questions about your data.

Perform the following steps.

  1. Find out how many emails customersupport@sampleoutdoors.com received, when they received them, and from whom.
    • Undo the time range filter by clicking on the X icon next to Filtered by, as shown in Figure 9.
      Figure 9. Undo time range filter
      This figure shows the X icon that lets you undo time range filter
    • Expand the To facet and click on customersupport@sampleoutdoors.com, as shown in Figure 10 to see how many emails were received, when they were received, and by whom they were received.
      Figure 10. Email information
      This figure shows how many emails did customersupport@sampleoutdoors.com receive, when and from whom?
  2. Find out which events indicate an error.
    • Undo the CodesAndValues filter by clicking the X icon next to Filter by, as shown in Figure 11.
      Figure 11. Undo CodesAndValues filter
      This figure shows the Undo CodesAndValues filter
    • In the Search text box, type *error*, as shown in Figure 12.
      Figure 12. Which events indicate “error”?
      This figure shows which events indicate “error”?
  3. Find out which events belong only to the Web Access or email batches.
    • Change the text search to contain batchId:batch_webaccess or batchId:batch_inbox, as shown in Figure 13.
      Figure 13. Which events belong only to the Web Access or email batches?
      This figure shows which events belong only to the Web Access or email batches?

      You have just looked at a few simple possibilities. The time range, text and faceted search can be used in any combination to build complex searches.

  4. To find out which emails are sent to customersupport@sampleoutdoors or websupport@sampleoutdoors, you will use text search, as you did in the previous step.

    Before you can exercise this feature for facets from a custom log type such as email, you have to make small changes to the configuration. You will do this in the Configure the index to add fields from custom log types section of this tutorial.


Configure the user interface for your use case

The data scientists at the Sample Outdoors company wanted to configure the user interface for their use case. They wanted to keep the facets that were meaningful and remove the rest. They also wanted to configure the results view so that some of the default facets shown in context were removed, while others were added.

Perform the following steps.

  1. The downloaded index.config can be found at downloaded data_and_config//config/index.config.
  2. Replace index.config from <accelerator_install_location>/mda_indexes/index directory with the downloaded one.
  3. The downloaded index.config has the following modifications.
    • The showFacet field has been set to false for several facets. This hides the facets in the user interface.
    • The showInResult has been set to false for several facets. This removes the facets display in the context of the result.
    • The showInResult has been set to true for some facets. This adds the facets to be displayed in the context of the result.
    • The createFacet field has been set to false for all the fields where showFacet is set to false. You will use this later in the Optimize! Burn configuration into the index section of the tutorial.
  4. Restart the console as follows.
    • Start a command line session to the machine running the console.
    • Log in as BigInsights administrator user, default is biadmin.
    • First run $BIGINSIGHTS_HOME/bin/stop.sh console, then run $BIGINSIGHTS_HOME/bin/start.sh console.
  5. Refresh the browser instance pointing to the user interface.
  6. Observe the changes. Numerous facets that were not applicable to the use case are not shown any more. Only the applicable facets are shown in context of the results pane.

The data scientists at the Sample Outdoors company made the configured search interface available to their teams to use. Once the teams are successfully able to accomplish their tasks, the data scientists will burn the configuration into the index.


Configure the index to add fields from custom log types

Add the fields from the email log type to the index so that you can use these facets in text search.

Perform the following steps.

  1. Locate the downloaded jsonFacetLogAnalysisIndexSchema.xml file, which can be found at data_and_config//config/ jsonFacetLogAnalysisIndexSchema.xml.
  2. Replace /accelerators/MDA/index_config/jsonFacetLogAnalysisIndexSchema.xml in HDFS with the downloaded one. You can use the Delete and Upload buttons on the Files tab to perform this replacement.
  3. Open it in the downloaded version. Notice the email specific fields that are added towards the end of the file.

This configuration will be used by the Index job in the next step.


Optimize! Burn configuration into the index

The data scientists at the Sample Outdoors company had configured the user interface to remove certain facets from the user interface. Their teams had worked on this configured user interface for a few days and were able to accomplish their search needs. The data scientists now wanted to burn this configuration into the index. Removing the facets that are not required for their use case from the index would optimize the size of their resulting index and search times.

Perform the following steps.

  1. In HDFS, click the Create Directory icon under the Files tab in the console to create a config directory under /GOMDADemo.
  2. Click the Upload icon under the Files tab in the console to upload the previously downloaded index.config (from the Configure the user interface for your use case section) to /GOMDADemo/config.
    • Notice that the configuration also has a field createFacet set to false for all the fields where the showFacet was set to false. The Index application will not create facets for these fields.
  3. To re-run the Index application, making the following changes.
    • Click on the previous Index application execution.
    • Keep the entries for Source directory and Output path the same as before.
    • Point the Configuration file to /GOMDADemo/config/index.config.
    • Select the Re-create Indexes check box to delete the existing index and create a new one.
  4. Re-run the copyIndex.sh utility, this time indicating Y to overwrite the existing index.
  5. Refresh the browser instance to point to the optimized index.

View the resulting Search interface

The data scientists were able to view the To and From values of emails in context with the results pane, as shown in Figure 14.

Figure 14. View fields from customer log types in context with result
This figure shows the view fields from customer log types in context with result

Finally, they were also able to search using the facets from custom log types. They provided the string To:websupport or To:customersupport in the Search text box and viewed the result, as shown in Figure 15.

Figure 15. View fields from customer log types in context with result
This figure shows the view fields from customer log types in context with result

The data scientists at the Sample Outdoors company made the configurations available to be deployed for production.

More groups within the Sample Outdoors company added their machine data to the common searchable repository. Every day, new machine data was added and new insights were gleamed.


Conclusion

With the help of a situation at the fictitious Sample Outdoors company, you looked at how you can use the IBM Accelerator for Machine Data Analytics to create a consolidated repository of various logs and search across them. You saw how customized log types can plug and play with out-of-the box log types, and how facets are discovered for the logs. Finally you saw how you can tune the configurations to create an optional solution for your use case.

Acknowledgements

Thanks to Tom Chen (thchen@ca.ibm.com) and Amit Rai (amitrai4@in.ibm.com) for their technical reviews on this article, and to all the Machine Data Accelerator team members who contributed to this feature.


Download

DescriptionNameSize
Code sampledata_and_config.zip---

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=856524
ArticleTitle=IBM Accelerator for Machine Data Analytics, Part 3: Speeding up machine data searching
publish-date=01312013