Before you start
This is a step-by-step guide on how to collect and analyze the logs produced by InfoSphere BigInsights. InfoSphere BigInsights 2.1 expands the data sources for the IBM Accelerator for Machine Data Analytics to support InfoSphere BigInsights and Hadoop logs, so you can integrate and analyze Hadoop logs.
This new log monitoring and analysis function is designed to expedite the troubleshooting of InfoSphere BigInsights applications. In previous releases of InfoSphere BigInsights, the troubleshooting process was tedious and time-consuming. It required manual aggregation of the log files and a manual file search within those log files. In addition, the log records were available from the InfoSphere BigInsights web console only for a finite period of time.
With InfoSphere BigInsights 2.1 and the IBM Accelerator for Machine Data Analytics, all of the Hadoop logs in InfoSphere BigInsights can be aggregated into a single location. The logs can be indexed and accessed through a faceted search in the InfoSphere BigInsights dashboard. Outdated logs are no longer deleted automatically from the index.
This article shows how to use the new InfoSphere BigInsights log monitoring features (referred to here as the log monitoring application), along with the IBM Accelerator for Machine Data Analytics to troubleshoot InfoSphere BigInsights applications. In this tutorial, learn how to:
- Start your collection of InfoSphere BigInsights logs.
- Index the collected logs and analyze them in the InfoSphere BigInsights dashboard.
- Verify the collection of logs in HBase.
You should be familiar with the set of applications available in the IBM Accelerator for Machine Data Analytics. Some familiarity with InfoSphere BigInsights text analytics tools is a plus, but not required. Read Part 1 of this series to get an overview of the IBM Accelerator for Machine Data Analytics (see Resources).
To run the examples in this tutorial, you need to:
- Install InfoSphere BigInsights 2.1
- Install the IBM Accelerator for Machine Data Analytics
The situation at a fictitious company, Sample Outdoors Company
At the Sample Outdoors Company, the IBM Accelerator for Machine Data Analytics has been in production for a few months. The Sample Outdoors Company has many multi-tier applications, and many of those applications are InfoSphere BigInsights applications. With InfoSphere BigInsights Version 2, the Sample Outdoors Company often discovered that logs for failed applications had been deleted long before they were needed for troubleshooting. Even if the log files were available, the process of wading through the various logs looking for the root cause of the application failure was a tedious process.
The Sample Outdoors Company has now upgraded to InfoSphere BigInsights 2.1 and has configured the BigInsights log monitoring application for collecting and analyzing InfoSphere BigInsights Hadoop logs. This application gives them better control over the log retention policies and makes troubleshooting errors much faster, since it points directly to the cause of the error, in most cases.
Using this example from the Sample Outdoors Company, this tutorial gives an overview of the log monitoring application in InfoSphere BigInsights and shows how to configure it to speed up troubleshooting.
Overview of the InfoSphere BigInsights log monitoring application
InfoSphere BigInsights 2.1 includes a toolset that enables the streaming of Hadoop logs in InfoSphere BigInsights to HBase. The IBM Accelerator for Machine Data Analytics analyzes the set of logs and displays the results in the InfoSphere BigInsights dashboard. Figure 1 shows an overview of the workflow for collecting and analyzing the InfoSphere BigInsights Hadoop logs.
Figure 1. Overview of the InfoSphere BigInsights log collection and analysis workflow
As shown, using the built-in monitoring agents, the Hadoop log files are configured to stream to HBase. This tutorial shows you how to start the monitoring agents and enable the streaming of the logs to HBase. It also describes how to verify that the Hadoop logs are being streamed to HBase.
After the Hadoop logs are properly streaming to HBase, the extraction and index applications can be run to analyze the logs collected in HBase. The components of the IBM Accelerator for Machine Data Analytics used for the analysis of the InfoSphere BigInsights logs are shown in the InfoSphere BigInsights Log Analysis box in Figure 1. This analysis happens one batch at a time and these applications can be scheduled for continuous monitoring of the InfoSphere BigInsights Hadoop operation.
After the InfoSphere BigInsights logs have been analyzed, the results of this analysis are displayed on the InfoSphere BigInsights dashboard, as shown in Figure 1. This tutorial shows how to configure the streaming of the InfoSphere BigInsights Hadoop logs to HBase and how to configure the analysis of the InfoSphere BigInsights Hadoop logs though the components of the InfoSphere BigInsights log monitoring application shown in Figure 1.
Collect and analyze InfoSphere BigInsights logs — a step-by-step guide
To collect and analyze InfoSphere BigInsights Hadoop logs:
- Enable and start the InfoSphere BigInsights log monitoring application.
- Run the InfoSphere BigInsights log monitoring application.
- Schedule the analysis of the InfoSphere BigInsights log monitoring application.
- Search through the results of the InfoSphere BigInsights log monitoring application.
- Verify the logs collected in the HBase table.
Step 1. Enable and start the InfoSphere BigInsights log monitoring application
Use the following steps to enable and verify the log streaming of InfoSphere BigInsights logs to HBase:
- Open the
$BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/hadoop-env.shfile located on the console node.
- Modify the line containing
export HADOOP_STREAMING_LOGS=falseand set it to
- Run the
$BIGINSIGHTS_HOME/bin/syncconf.shscript to ensure that the changes in the
hadoop-env.share propagated throughout the InfoSphere BigInsights cluster.
- Restart the cluster by first issuing a stop command:
- Issue a start command:
This procedure enables you to start the log collection of the InfoSphere
BigInsights logs into HBase. These steps are required to start the log
collection, but they have to be done only once. After this process, you
can turn on and off the log collection without modifying the
hadoop-env.sh file. To turn on the log collection follow
- From the InfoSphere BigInsights console, go to the Application tab and select Manage.
- In the search bar, type "LogCollection" to find the log collection application.
- Select and deploy the log collection application, as shown in Figure 2.
Figure 2. Select and deploy the log collection application
After the InfoSphere BigInsights log collection application has been deployed (following the above instructions), it can be turned on and off. To turn on the log collection, first turn on the monitoring for the InfoSphere BigInsights cluster using these steps:
- From the InfoSphere BigInsights console, click the Cluster Status tab.
- Click Monitoring (which should have a red X next to
it with the status
Unavailable, as shown in Figure 3).
- Click Start on the Monitoring Summary window, as shown in Figure 3.
- Wait until you see the green checkmark next to the Monitoring node
with the status
Running(instead of the red X as in step 2).
Figure 3. Enabling InfoSphere BigInsights monitoring
After the monitoring is enabled, start the log collection of the InfoSphere BigInsights Hadoop logs. To start the log collection, follow these steps:
- From the InfoSphere BigInsights console, click the Applications tab.
- Click Run, as shown in Figure 4.
LogCollectionin the search box for finding the log collection application.
- Check the Enable Log Collection box in the log collection application, as shown in Figure 4.
- Run the application (should complete in about 40 seconds).
Figure 4. Starting the log collection
To stop the log collection at any point, simply uncheck the Enable Log Collection check box shown in Figure 4 and run the log collection application. The log collection application is stopped if you stop the monitoring from the Cluster Status tab of the InfoSphere BigInsights console (see Figure 3).
At this point, the set of InfoSphere BigInsights logs should be streaming to HBase. Suppose that the Sample Outdoors Company discovers that the IBM Accelerator for Machine Data Analytics extraction application is failing and wants to see the logs for that application. The Sample Outdoors Company finds a typo in the input field for the extraction application as shown in Figure 5.
Figure 5. Using the extraction application
As we see in Figure 5, the Sample Outdoors company has forgotten to place
the backslash (
/) at the beginning of the source directory in the
extraction application. To replicate this issue, run the extraction
application with the exact same parameters shown in Figure 5.
Step 2. Run the InfoSphere BigInsights log monitoring application
Step 1 verified that the InfoSphere BigInsights Hadoop log collection is actually taking place. In Step 2, learn how to analyze the set of InfoSphere BigInsights logs collected in HBase.
The InfoSphere BigInsights log monitoring application is a BigInsights chained application of two separate applications: the log monitoring extraction application and the index application. Both the log monitoring extraction application and the index applications are included with the IBM Accelerator for Machine Data Analytics.
The log monitoring extraction application is created for the purpose of
analyzing the InfoSphere BigInsights Hadoop logs. This application
basically extracts key features from the set of logs collected in
Hadoop HBase table. The index application does a two-step
process. This application first indexes the extracted data, generates a
taxonomy, and puts the data and the taxonomy in a Distributed File System (DFS) folder chosen by
the user. Then this application copies the extracted index to the local
console node machine in the InfoSphere BigInsights cluster. This second
step is necessary for the visualization of the logs. For the index
application to copy the index from the DFS folder to the local console
node, we have to provide a credentials store file for accessing the console
node of the InfoSphere BigInsights cluster. The sample credentials store
file is located on DFS in
and it will only be available after the installation of the IBM
Accelerator for Machine Data Analytics. The following is the content of
this sample credentials file.
#BigInsights Credential Store file #Contains the Console Node Host ID, the login Username/Password for the console node password=BIpassword username=BIusername host=ConsoleNodeHostID
To modify this credentials store file, follow these steps:
- For host put the host name of the console node.
- For username put the InfoSphere BigInsights system
administrator (by default,
biadmin) local for the InfoSphere BigInsights console node machine.
- For password put the password for the InfoSphere BigInsights system administrator local for the InfoSphere BigInsights console node machine.
The password in the credentials file can be encrypted, and this file can be also modified using the credentials file generator. After the credentials file is ready, run the InfoSphere BigInsights log monitoring application:
- Go to the Applications tab of the InfoSphere BigInsights web console.
BigInsights log monitoringin the search bar.
- Run the InfoSphere BigInsights log monitoring application according to the parameters shown in Figure 6.
Figure 6. Parameters for the InfoSphere BigInsights log monitoring application
Following are the parameters required:
- In the Extract Output Directory, provide the directory where you would like the extracted output of the log monitoring extraction application to be. Note that if this directory doesn't exist, it will be created.
- In the Days to Ingest field, type the initial number of days of logs you want to examine. The log monitoring extraction application saves a cursor file that keeps track of the time that the previous run took place. In Figure 6, the Days to Ingest parameter is set to one day. If the last InfoSphere BigInsights log monitoring application was run within the last day, this parameter is ignored and the logs will be fetched from the Hadoop table starting from the date given in the cursor file.
- In the Index Output Directory, type the DFS directory that will contain the indexed results. Note that if this directory doesn't exist, it will be created.
- In the Credentials File Path, give the DFS path to the credentials file created at the beginning of this section.
- (Optional) Uncheck the check boxes for the log types you don't want analyzed, as shown in Figure 6. Use this if the there are too many logs (or if the logs are too large) to analyze and index within a given timeframe.
The logs with the most volume are generally the name node logs and data node logs. The next step is to schedule the InfoSphere BigInsights log monitoring application for continuous analysis of the BigInsights logs.
Step 3. Schedule the analysis of the InfoSphere BigInsights logs
The InfoSphere BigInsights log monitoring application can be run under a schedule to analyze the BigInsights logs in batch mode. Follow these steps to schedule the InfoSphere BigInsights log monitoring application:
- Check the Schedule Job check box in the application parameters, as shown in Figure 7.
- Fill in the Start Date field of the InfoSphere BigInsights log monitoring application run.
- Choose the end date by filling in the Until field, as shown in Figure 7. (The end date should be later than the start date).
- Specify the Frequency to run the InfoSphere BigInsights log monitoring application, as shown in Figure 7.
Figure 7. Schedule parameters of the InfoSphere BigInsights log monitoring application
We recommend you run the consecutive InfoSphere BigInsights log monitoring application in two-hour intervals to keep the application runs from overlapping.
If the size of the index on the distributed file system (DFS) starts getting too large, you can also periodically delete log records from the index in DFS:
- Go to the Applications tab in the InfoSphere BigInsights console.
- Type Index Management in the search box.
- In the Index Directory field, give the directory of your large index, as is shown in Figure 8.
- In the Retention Time field, give the number of hours of log records you want to retain (the application will delete all log records older than the retention time specified).
- Check the Schedule Job check box.
- Provide the Start Date and time to run this application.
- Provide the end date and time by filling in the Until field.
- Provide the Frequency at which to run this application, as shown in Figure 8.
- Run the application.
Figure 8. Parameters of the index management application
Step 4. Search through the results of the InfoSphere BigInsights log collection and analysis
After running the InfoSphere BigInsights log monitoring application, the Sample Outdoors Company wants to drill down to the error that caused the crash. Follow these steps to see the results of the InfoSphere BigInsights log monitoring analysis:
- Go to the Dashboard tab of the InfoSphere BigInsights console.
- From the Select Dashboard drop-down menu, select System dashboard.
- Click on the Search tab.
At this point, you should see the results of the InfoSphere BigInsights log collection and analysis in a user interface similar to Figure 9.
Figure 9. IBM Accelerator for Machine Data Analytics search function containing analysis of InfoSphere BigInsights logs
Since all of the Hadoop logs collected analyzed are in SYSLOG format, the Severity facet can filter all the logs with errors. These can be used for troubleshooting. As we see from Figure 9, there is a single log record with severity "Error." We can drill down on the ERROR facet by clicking on it. The log record containing the error is displayed with the stack trace. From the stack trace, we see the following statement:
java.lang.Exception: Number of exceptions  exceeded threshold , data context: "hdfs://hdtest261.svl.ibm.com:9000user/biadmin/extract_in"
That statement in the stack trace makes it obvious that the
/ is missing from the input path. Thus, the Sample Outdoors
Company is able to resolve the issue that triggered the crash on
Step 5. Verify the collection of logs in HBase
To verify that the logs are being streamed to HBase:
- Start an InfoSphere BigInsights application. One way to run a sample
InfoSphere BigInsights application is by running
$BIGINSIGHTS_HOME/hdm/bin/hdm terasort, which will run the InfoSphere BigInsights test application.
- Repeat Step 1 several times to have some logs collected in HBase.
- Open the HBase console by typing
$BIGINSIGHTS_HOME /hbase/bin/hbase shell.
- The HBase table being populated is named Hadoop.
Report the count on the rows by typing
count 'Hadoop', and you should see a massage similar to this:
hbase(main):001:0> count 'Hadoop' 76 row(s) in 0.4860 seconds
The message above indicates that the
Hadoop HBase table
contains 76 rows of data. To see the actual data inside the
- Change the working directory to the HBase bin by typing the command
- Type the command
echo "scan 'Hadoop' | ./hbase shell > myTtable.txt.
- Check the myTtable.txt in the $BIGINSIGHTS_HOME/hbase/bin directory.
The myTtable.txt file will contain all the logs that have been streamed to
Hadoop table so far.
Accelerate troubleshooting your InfoSphere BigInsights applications
We have shown how to effectively troubleshoot the InfoSphere BigInsights applications by collecting, extracting, indexing, and searching the Hadoop logs generated by the InfoSphere BigInsights applications.
Now the Sample Outdoors Company is able to monitor and easily troubleshoot InfoSphere BigInsights applications because all the Hadoop logs are indexed and available through the InfoSphere BigInsights dashboard. The Sample Outdoors Company is now also able to control the retention time of the analyzed Hadoop logs through the index management application.
- Read Part 1: Speeding up machine data analysis of this series to get an overview of the IBM Accelerator for Machine Data Analytics (developerWorks, January 2013).
- Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights' architecture and underlying technologies.
- Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights video to listen to Cindy Saracco discuss some of the frequently asked questions about IBM's Big Data platform and InfoSphere BigInsights.
- Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big Data -- Analyzing Social Media for Watson.
- Read "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more about the InfoSphere BigInsights web console.
- Check out the BigInsights Technical Enablement wiki for links to technical materials, demos, training courses, news items, and more.
- Learn about the IBM Watson research project.
- Take this free course from Big Data University on Hadoop Reporting and Analysis (log-in required). Learn how to build your own Hadoop/big data reports over relevant Hadoop technologies, such as HBase, Hive, etc., and get guidance on how to choose between various reporting techniques: Direct Batch Reports, Live Exploration, and Indirect Batch Analysis.
- Learn the basics of Hadoop with this free Hadoop Fundamentals course from Big Data University (log-in required). Learn about the Hadoop architecture, HDFS, MapReduce, Pig, Hive, JAQL, Flume, and many other related Hadoop technologies. Practice with hands-on labs on a Hadoop cluster using any of these methods: On the Cloud, with the supplied VMWare image, or install locally.
- Explore free courses from Big Data University on topics ranging from Hadoop Fundamentals and Text Analytics Essentials to SQL Access for Hadoop and real-time stream computing.
- Create your own Hadoop cluster on the IBM SmartCloud Enterprise with this free course from Big Data University (log-in required).
- Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data for details on two of IBM's key big data technologies.
- Learn more about Apache Hadoop.
- Learn more about the Apache Hadoop Distributed File System.
- Check out the HadoopDB Project website.
- Read the Hadoop MapReduce tutorial at Apache.org.
- Read "Using MapReduce and load balancing on the cloud" to learn how to implement the Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing to improve the performance of both a single- and multiple-node system.
- For information on installing Hadoop using CDH4, see CDH4 Installation -- Cloudera Support.
- Big Data Glossary By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011.
- Hadoop: The Definitive Guide, by Tom White, O'Reilly Media, ISBN: 1449389732, 2010.
- "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads" explores the feasibility of building a hybrid system that takes the best features from both technologies.
- "SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions" describes the motivation for this new approach to UDFs, as well as the implementation within AsterData Systems' nCluster database.
- Check out "MapReduce and parallel DBMSes: friends or foes?"
- "A Survey of Large Scale Data Management Approaches in Cloud Environments" gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud, which are gaining a lot of momentum in research and industrial communities.
- Refer to the IBM InfoSphere BigInsights Information Center for product documentation.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Download InfoSphere BigInsights Quick Start Edition, a complimentary, downloadable version of InfoSphere BigInsights. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets.
- Get Hadoop 0.20.1, Hadoop MapReduce, and Hadoop HDFS from Apache.org.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.
- Use InfoSphere Streams on IBM SmartCloud Enterprise.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Ask questions and get answers in the InfoSphere BigInsights forum.
- Ask questions and get answers in the InfoSphere Streams forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.
- Check out IBM big data and analytics on Facebook.