System log analysis using InfoSphere BigInsights and IBM Accelerator for Machine Data Analytics

How to mine complex system logs for clues to performance issues

When understood, logs are a goldmine for debugging, performance analysis, root-cause analysis, and system health assessment. In this real business case, see how InfoSphere® BigInsights™ and the IBM Accelerator for Machine Data Analytics are used to analyze system logs to help determine root causes of performance issues, and to define an action plan to solve problems and keep the project on track.

Vincent Cailly (vincent.cailly@fr.ibm.com), IBM-certified IT Architect, IBM

Vincent Cailly is an IBM-certified IT architect with 25 years of experience in IT projects. He joined IBM 10 years ago. For the first four years, he worked within the IBM strategic outsourcing division as the lead architect of several customers' large, complex, and worldwide infrastructure projects, then he moved to IBM Software Group. In his current role, one of his missions is to follow up the IBM software deployment at his customers to check they are getting the value they are expecting from this IBM software.



01 October 2013

Introduction

As systems become more complex, it becomes increasingly difficult, without the right tooling, to quickly assess system health and to troubleshoot problems. This article shows how InfoSphere BigInsights and the IBM Accelerator for Machine Data Analytics can:

  • Increase visibility, making it easier to gauge the health of systems and applications
  • Tremendously accelerate troubleshooting when problems occur

One of my customers has decided to deploy IBM Maximo Enterprise Asset Management (EAM), a global and effective system to monitor and manage the visibility, deployment, performance, reliability, availability, lifespan, and maintenance of assets, worldwide. This is a large and complex project because the solution has to be deployed in about 80 plants across five continents. The deployment is 20 percent complete so far.

Recently, the customer was experiencing severe performance issues with this new system, which is permanently changing because of the deployment in progress. The IT and IS operational teams were having trouble finding the root causes of the performance issues, and the customer asked for ideas that might accelerate the root-cause analysis and the resolution of these problems.

I suggested a proof-of-concept solution using InfoSphere BigInsights and the IBM Accelerator for Machine Data Analytics to analyze the logs of the system with two objectives:

  • Help customer resolve the performance issues
  • Demonstrate the value of this IBM solution

When understood, logs are a goldmine for debugging, performance analysis, root-cause analysis, and system health assessment. But knowing that, both the customer and I were surprised by all the findings this proof-of-concept solution revealed. We were able to quickly determine root causes of the performance issues and define an action plan to solve the problems and keep the project on track.


Technical environment for the proof-of-concept solution

The proof of concept includes an application based on Maximo Enterprise Asset Management and InfoSphere BigInsights, running IBM Accelerator for Machine Data Analytics applications.

Application based on Maximo Enterprise Asset Management

The customer is running two instances of the application: one for North American users and one for European users. All servers for both instances are located in Europe. Each instance of the application is made of the following components:

  • One IBM HTTP Server instance
  • Six IBM WebSphere® Application Server instances to run the user interface, cron tasks, and on-demand reports
  • One IBM WebSphere Application Server instance to run scheduled reports, cron tasks, and the Maximo integration framework used for the integration of the application with an Enterprise Resource Planning (ERP) solution
  • One Oracle database

InfoSphere BigInsights environment

InfoSphere BigInsights was installed on a stand-alone machine (a virtual machine running on an IBM ThinkPad W530) and log files of the application were manually transferred to this virtual machine.

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

In this InfoSphere BigInsights environment, for each instance of the application we imported the following logs:

  • The IBM HTTP Server access log (one semi-structured text file).
  • The SystemOut and SystemErr logs of all the WebSphere Application Server instances, which include 154 non-structured text files. When the 10MB buffer is reached, the current log file is closed and renamed. A new log file is created. In the WebSphere configuration, the number of log files to rotate is set to 10. In this case, we have 22 log files per application server:
    • SystemOut log files: one current log file plus 10 renamed files
    • SystemErr log files: one current log file plus 10 renamed files

    These logs are rotating when they reach 10MB.

  • The Oracle database alert log (one semi-structured XML file).

In total, there are 312 log files. One can easily imagine the nightmare of having to manually analyze these 312 log files without the right tooling.

InfoSphere BigInsights MDA applications

We ran the following InfoSphere BigInsights MDA applications:

  • A Distributed File Copy application to import the logs into the Hadoop file system.
  • An Extract application that uses text analytics to extract information from the batches of log files ingested into InfoSphere BigInsights.
  • An Index application to index the record of all log files. The creation of this index is required to use the faceted browsing interface to quickly find log entries based on multiple criteria and to expedite troubleshooting.
  • Due to limited physical resources (mainly storage) of the virtual machine, we did not run the following InfoSphere BigInsights MDA applications:
    • The Frequent Sequence Analysis application, which examines which pattern of events happens most commonly before an error condition
    • The Significance Analysis application, which examines which specific events are the most likely cause for an error condition
  • The BigSheets feature to produce specific reports and feed dashboards.

The next section describes how the proof-of-concept solution increases visibility into the health of the system and enables faster troubleshooting. It includes some examples of the outputs provided by IBM BigInsights and the IBM Accelerator for Machine Data Analytics for this particular case.


Increased visibility and faster troubleshooting enabled by this solution

This solution makes it easier to see what's going on inside the interconnected systems and makes it faster to troubleshoot problems by providing these advantages:

  • 360-degree view
  • Faceted browsing
  • Log analysis using dashboards
  • Measure of number of Maximo EAM error messages
  • Analysis of the Maximo BMXAA6720W warning message
  • Measure of number of Oracle error messages

360-degree view

This solution makes it possible to get a 360-degree view of all of the events logged by different components. Logs from IBM HTTP Server, WebSphere Application Server, and Oracle database server have been transformed, aggregated, and indexed to enable an advanced search across the different log files (156 log files per instance of this Maximo EAM application, in this case). This enhanced view tremendously facilitates troubleshooting and determination of root causes.

Figure 1. 360-degree view
Figure 1 shows 360-degree view

Faceted browsing

The faceted browsing interface makes it easier to quickly find log entries based on multiple criteria. Figure 2 shows how easy it is to find log entries in the 156 log files of one instance by using multiple search criteria.

Figure 2. Using faceted browsing to locate log entries
Image shows using faceted browsing to locate log entries

HTTP log analysis using InfoSphere BigInsights dashboards

The InfoSphere BigInsights dashboard makes it easy to publish and share the output of the analysis. It facilitates the communication and the collaboration between the different IT and IS teams (development team, IT operational teams, etc.).

Figure 3 shows a dashboard where we have published the results of the analysis of the HTTP access logs, including:

  • HTTP status codes for all the HTTP requests received by the HTTP server. The status codes allow you to check:
    • The number of errors (HTTP status codes > 400) logged by the HTTP server. This number helps gauge the health of the application.
    • The browser caching efficiency: the ratio of 304 HTTP status codes. (The ratio is the number of HTTP requests with a 304 status code to the total number of HTTP requests.)
  • URL paths causing the HTTP status code 404 errors often result in decreased performance for users, even though the decrease is sometimes invisible.

    Recommendation to improve server performance, eliminate all 404 errors.

  • The number of HTTP requests per IP address allows you to view any suspect IP addresses sending many more HTTP requests than other IP addresses.
  • The version of the HTTP protocol used for all the HTTP requests.
    Figure 3. Dashboard where we have published the results of the analysis of the HTTP access log
    Image shows dashboard where we have published the results of the analysis of the HTTP access log

Viewing this dashboard, we can make recommendations based on some preliminary conclusions about the performance problems:

  • Pie charts at the far left of Figure 3— Notice that only 7 percent of the HTML objects are fetched from the user agent cache (HTTP status code 304) for the EU instance, compared to 47 percent for the NA instance.
  • Bar graphs to the right of the pie charts in Figure 3— Some IP addresses are exhibiting suspect behaviors. For example, some IP addresses are sending many more HTTP requests than a standard user of the application.

    Recommendation: After further investigation, we discovered that these IP addresses were allocated to machines running scripts to monitor response times of the application. Some advanced power users were trying to measure response times to provide evidence about these response times, but they did not realize that those scripts were degrading the overall performance of the system (in particular, server resource utilization and WAN bandwidth utilization). In addition, those scripts distorted the information about HTML objects fetched from the user agent cache (HTTP status code 304). For some of these scripts, HTML objects were always fetched from the user agent cache. We suggested stopping those scripts.

  • Tables to the right of center in Figure 3— In Europe, some client machines are using V1.0 of the HTTP protocol instead of V1.1. In terms of performance, using HTTP 1.0 generally leads to a bad experience because HTTP 1.0 does not allow multiple requests to use a single connection.

    Recommendation: After additional investigations, we discovered that the HTTP 1.0 requests were sent by legacy end-user obsolete MS Windows® XP workstations running Microsoft Internet Explorer V6 (see Resources for a link to Microsoft Support). So we made the recommendation either to apply the solution proposed by Microsoft or to implement a snippet on the application authentication page to test the browser being used. If Microsoft Internet Explorer V6 is detected, we recommended asking the user to switch to another browser, such as Mozilla Firefox V3.5.

  • Tables at the far right of Figure 3— All the URL paths at the origin of HTTP status code 404 are displayed.

    Recommendation: We suggested making the required changes on the application to get rid of all these 404 errors.

Error messages logged by IBM Maximo software

Another indicator to help assess the health of the application is the number of Maximo error messages logged in the WebSphere Application Server SystemOut log files.

The WebSphere Application Server log extractor that comes with the IBM Accelerator for Machine Data Analytics does not allow you to get immediate information on these Maximo error messages. The format of records containing these error messages is not always the same.

To get this information we had two options:

  • To develop our own extractor
  • To use the BigSheets feature of InfoSphere BigInsights

We chose the BigSheets feature. Just by using standard basic BigSheets functions (MID, SLICEITEM, PIVOT, FILTER, etc.), it took less than half an hour to produce reports on the number of Maximo error messages logged (see Figure 4 and Figure 5.)

Some Maximo errors messages showed up frequently in the logs. We suspect either technical issues at the application level or defects in the Maximo software. We recommended opening a PMR to request deeper analysis of the root causes so we can resolve the underlying problems.

Figure 4. EU instance
Image shows EU instance
Figure 5. NA instance
Image shows NA instance

Analysis of the Maximo BMWAA6720W messages

The Maximo BMXAA6720W warning message indicates long-running query execution and provides useful information about the performance of the system.With BigInsights, we can easily extract the information highlighted in bold in the log record sample below.

WebSphere SystemOut log records containing the Maximo BMXAA6720W warning message look like this:

[6/25/13 8:28:32:140 CEST] 000000ec SystemOut     O 25 Jun 2013 08:28:32:140 [WARN]
BMXAA6720W - USER = (UID00195) SPID = (2082) app (WOTRACK) object (WORKORDER) : 
select * from workorder  where (workorderid = 4568)  (execution took 1317 milliseconds)

As for the Maximo error messages, we decided to use the BigSheets feature of InfoSphere BigInsights to extract fields highlighted in bold in the sample log record provided above. Then we produced reports highlighting problems using SQL queries (see Figure 6).

Further and deeper analysis by a database specialist revealed several issues at the level of the database server:

  • Lack of physical memory on the database server
  • Data model design issues
  • Problems with some indices
  • Oracle database software bugs that are fixed with more recent versions of this software
Figure 6. Reports highlighting problems using SQL queries
Image shows reports highlighting problems using SQL queries

Oracle error messages logged in the Oracle alert log

As we did for the Maximo error messages, we used the BigSheets feature to produce reports on the number of Oracle error messages logged into the Oracle alert log (see Figure 7). This is just another indicator that helps assess the health of the system.

Figure 7. Oracle alert log
Image shows Oracle alert log

Conclusion

Used together, InfoSphere BigInsights and the IBM Accelerator for Machine Data Analytics are useful in debugging performance problems, specifically in the case of this Maximo EAM application. But this solution can be applied to other situations and systems, as well. We are now working with the customer to deploy the BigInsights solution for operations in the production environment.

We will pilot three business-critical applications, and if the pilot is successful, the solution will be deployed for the 20 most critical applications for this customer. For the pilot and for each application, three use cases will be covered:

  • Publication of dashboards providing information about the health of the system. The dashboards will be:
    • Automated to occur daily
    • Shared across operation teams to facilitate collaboration
    • Set up to enable the customer to act proactively when deviations are observed
  • Validation of major application releases before moving them into production. This includes analyzing the logs of the QA environment to facilitate the decision about whether to move into the production environment.
  • Problem troubleshooting and resolution using the advanced features of the proof of concept solution to accelerate root cause analysis and problem resolution.

In short, this proof-of-concept solution can be applied in many contexts to increase visibility into the health of interconnected systems and to speed troubleshooting and root-cause analysis.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=946732
ArticleTitle=System log analysis using InfoSphere BigInsights and IBM Accelerator for Machine Data Analytics
publish-date=10012013