Agile data analysis using a KNIME workflow
Make data analysis more agile with workflow engines
Workflow tools make data analysis agile
In the developerWorks article Analyzing data in an agile world, I explain that for data analysis to be agile it has to satisfy three main characteristics:
- Analytic components need to be modular so that they can be used in many different situations.
- Data that is to be processed needs to look similar to a rectangular table so that there are minimal data format incompatibilities between components.
- Individual analytic components need to be able to be assembled to form a complete analysis with a minimum of programming.
Workflow tools such as those used for data mining, bioinformatics, and business analytics meet these requirements. A workflow engine to implement a data analysis offers many advantages over a single piece of analysis software.
To run a workflow is not a single, monolithic operation. Each unit of computation, usually called a node, can be run independent of the full analysis. The only requirement is a source of input data. The input data can be real data acquired through a test or simulated data to validate the logic of the node. The processed data can be examined in a tabular format before it is passed on to the next node to verify that it has been processed correctly. The same data can also be validated graphically.
The sources of data are independent of the data processing. Reader nodes internalize data into the workflow. After data is read into a node it appears as a table independent of the source. Therefore, the analysis part of the workflow is unchanged even if the source of data has changed. This arrangement facilitates reuse and makes the use of a particular workflow more agile because the format of the external data (a file in CSV, JSON, XML, or text format) can change without affecting the underlying analysis logic.
Because the ingestion and examination of data is the responsibility of the workflow engine, very large data sets can be processed and the writer of the workflow does not have to explicitly worry about memory management.
If custom programming is required, it can be isolated to a specific node or nodes. Programming details can be hidden from end users who only need to know the type of analysis being performed by the node. End users do not need to worry about the details of how the analysis is implemented.
After individual workflows are created they can be encapsulated into a metanode so that they appear similar to a node that can be embedded in a larger workflow.
Use KNIME for agile analysis
KNIME is an open-source workflow engine and tool that is ideal for this type of data analysis. It installs with over 1000 predefined nodes and is supported by many external analysis toolkits, both commercial and open source. KNIME can be downloaded and is available on Microsoft® Windows®, Mac®, and Linux®. You can get tutorials to help learn KNIME but the idea is very simple. You can drag individual nodes from the node repository and place them on a canvas. To indicate the flow of data, draw an arrow from an output port of a node to an import node of the next node in the flow.
Sample Apache access log analysis
For example, you can use KNIME to analyze the source of resource consumption during the operation of a traditional three-tiered architecture with an Apache HTTP Server® that accepts incoming HTTP requests. These requests are routed to an application server node such as IBM® WebSphere® Application Server or other middle-tier server that can be used to process dynamic requests.
To understand the impact of load on resource consumption you need to correlate the external load based on the HTTP requests issued over time with the server resources that are used by the application server or database tier. To do so, you need to:
- Parse the Apache access log file.
- Normalize the timestamp from the log file to focus on a specific period of time.
- Determine an elapsed time.
- Extract the specific URIs requested during those times.
Download the sample KNIME workflow to learn how to do this analysis within KNIME and how you can use a similar workflow to extract resource information from other types of log files such as CPU logs, database resource logs, or garbage collection logs to perform this type of correlation.
To import the sample workflow, download it to your file system and import it within KNIME by selecting File>Import KNIME workflow... and follow the import wizard.
Look at the imported workflow (also shown in Figure 1) to see that the workflow starts with a Weblog Reader that can be configured to read the format of the Apache access log.
Figure 1. KNIME workflow to process access log
Double-click the node, or right-click and select Configure... to bring up the configuration dialog shown in Figure 2. From there you can specify the specific log that you want to analyze, the locale of the text in the file, the date and time format for the timestamp and the overall format of each line in the log file. The data and time format is defined by the Java SimpleDateFormat. See the JavaDoc for the specific formatting options. The log line format is defined in the Apache documentation and is identical to the format from the Apache configuration.
Figure 2. Weblog Reader configuration dialog
After you configure the node, the red light underneath the node changes from red to yellow, which indicates that the node is ready to be run but has no data associated with it yet.
The reader node can be run independently of the whole workflow in several ways that are consistent with the Eclipse interface:
- Use the F7 key on the keyboard.
- Click Node > Execute from the menu.
- Click the run button from the button bar at the top of the window.
If it runs successfully, the yellow light turns green ( ) and the node has the data from the log associated with it. To see the log data, right-click the node and select Weblog table. This action results in a spreadsheet-like table that contains the parsed contents of the log file. This table is scrollable and can be used to validate that the log file has been interpreted correctly, as shown in Figure 3.
Figure 3. Parsed access log information
Use this approach to proceed from node to node to validate the interpretation of the input data as it gets processed and transformed as it goes through the workflow. Refer to Figure 1 or the sample workflow you downloaded to see that for the nodes in the first purple rectangle, the initial parsed data gets enhanced with additional information:
- The country that the request originated from, based on the IP number is included
- The specific request URI, split from the full GET request
To compare timestamps from several log files, you need to correlate the timings. The clocks on each of the machines that house each server might be slightly different or might be in different time zones. These mismatches make direct comparison by timestamp difficult. To resolve this issue calculate an elapsed time based on some point in time that is defined as the test start. With this method, you can fine-tune the timebase for each log file and make it easier to do the correlation based not on clock time but on time into test (elapsed time). Because the interpreted timestamp is stored internally as a date/time object, it is straightforward to determine time differences between two date/time objects using a time difference node.
The configuration dialog for the time difference node is shown in Figure 4. A time difference can be calculated from a column of dates and times either by subtracting from the execution time of the workflow, by subtracting from another column of the same length, or by subtracting a fixed date and time. The granularity of the difference can also be specified. This workflow specifies the specific data and time of the test start with a granularity of seconds. This action results in a new column with the elapsed time.
Figure 4. Configuration Dialog for time difference node
The next block of nodes in the second purple rectangle is to clean up the data table so that it has only the specific rows and columns that are of interest in the output. With this way of calculating the time difference, it is necessary to multiply the time diff column by -1 to convert the values to be the elapsed time.
In this example, the last node writes this filtered table to a CSV file where it can be imported into a spreadsheet or plotted using a graphing package. However, you can perform a more complex analysis with the filtered content within KNIME and pass it into another workflow.
Advantages of using a workflow
This type of analysis can be performed programmatically but a workflow offers several advantages:
- The flow of the program is easily understood visually. You do not need to know a programming language or the API of libraries required to provide various functions, such as the country lookup or time difference calculation.
- The weblog reader can be reused for many kinds of logs and can be configured for many date and time and log formats.
- You can validate the processing of the parsed data incrementally.
- You can process other types of log files with a similar method, with small modifications to this workflow.
How to analyze resource logs
Flexibility is the biggest reason a workflow engine such as KNIME can be a critical component to help analyze test data in an agile fashion. To correlate the data from the
access.log analysis with CPU data obtained by using one of the SYSSTAT tools such as sar on a Linux platform or from the Windows Performance Monitor, you can use a similar workflow to analyze system metrics in KNIME.
Because it is possible to export metrics from these tools to a CSV file, the primary change to the workflow described in this article is to replace the Weblog Reader with a CSV Reader and explicitly convert the timestamp to a date and time using the String to Date/Time node because the CSV Reader does not do this operation automatically, the way that the Weblog Reader does. The rest of the workflow is essentially the same. The resulting output CSV files can be saved and plotted using an external plotting tool to see the relationship between request type and resource utilization, or plotted within KNIME with a number of existing plotting nodes.
The word agile is an adjective that means being able to move quickly and easily and being able to think and understand quickly. As you use agile development methods to write software more quickly and flexibly you need to be equally agile in your understanding of how this software performs and scales.
Workflow engines such as KNIME liberate test organizations from the need to write and rewrite analysis scripts as the product evolves from iteration to iteration. KNIME frees testers from the requirement to mechanically run data analyses and makes it possible for them to interpret the results of these analyses without the need for deep programming skills.
- Get a quickstart guide from the KNIME site.
- Get details on how to obtain and use IBM WebSphere Application Server.
- Find Apache custom log format documentation.
- Download KNIME desktop
- Download a free trial version of Rational software.