Agile data analysis - Advanced workflows and R integration

Use advanced workflow techniques and the R statistical toolkit

To perform agile data analysis in a production environment, the processing of data must be integrated with tests that require a minimum of user interaction. The configuration of the analysis tools, therefore, must be done programmatically. You also need fast access to graphical reporting, which gives a visual representation of the resulting analysis, to help you determine whether further investigation is required. Learn how to use advanced workflow techniques in KNIME, an Eclipse-based graphical workbench for data analysis and reporting, and the R analysis package to create production-ready workflows and complex graphical representations of the processed data at any point in the workflow.

Scott Snyder (shsnyder@us.ibm.com), Senior Performance Architect, IBM

author photoScott has worked to improve the performance of server side technologies for fifteen years. With various load testing and analysis techniques he unravels the root causes for many performance and scalability problems.



10 June 2014

Also available in Japanese

Introduction

To make optimal use of agile data analysis in a production environment, the processing of data must be integrated with sets of automated tests that require a minimum of user interaction. Text-based inputs and outputs need to be programmatically configurable. Some form of graphical reporting, which gives a visual representation of the resulting analysis, must be available as quickly as possible to determine whether further investigation is required.

Workflows in KNIME, an Eclipse-based graphical workbench for data visualization and reporting, support this type of automation through the use of variables, programming nodes, and meta nodes.

Although an output file in CSV format is useful, a graphical presentation of the data can sometimes aid in interpretation of the data. The latest version of KNIME integrates well with the R analysis package, which makes it possible to create complex graphical representations of the tabular data at any point in the workflow.


Advanced KNIME workflows for production use

As a project grows over time, the analysis infrastructure must be able to adapt to changing requirements and operational objectives. To stay agile, workflows must be able to support changing environments and to scale, as workflows grow more complex.

Use variables in KNIME workflows

In the sample Apache access log workflow defined in the previous article of this series, the name of the output file is static, defined within the CSV Writer node. Although the use of a static file name is acceptable in an example, in production use it's more practical for the name of the output file to reflect the name and location of the input file. This naming convention offers several advantages:

  • The name of the input file indicates the source of the original data
  • The meaningful name helps avoid the risk of deleting a previous analysis
  • Rather than manually creating a unique name for the output CSV file every time that the input log file changes, the name changes automatically

Consider using variables to assign the name. In KNIME, variables that are used within a workflow are called flow variables. As in most programming languages, a variable is a string that is associated with information to be used at another place when the program is run.


Make the sample workflow more flexible

Enable the node monitor

After you start working with variables in KNIME, it is helpful to enable the node monitor, which displays the value of variables available to each node. If the node monitor view is not currently visible in your current KNIME perspective, click View > Node Monitor from the menu bar, as shown in Figure 1. If it does not appear in the menu, click View > Other... and then select it from the KNIME Views presented in the dialog box.

Figure 1. Node Monitor under KNIME Views
List of KNIME views with Node Monitor highlighted

In the Weblog Reader, the node monitor contains only one variable, knime.workspace, which is the directory containing your current workspace. The goal is to create a variable that contains the name and full path to the access.log file being analyzed.

To create a variable that indicates the name and path to the access.log file being analyzed, place a File Upload node on the work area. Instead of the small triangle that represents data output from the node, there is a small red dot, which indicates that the output is a flow variable, not data. This red dot needs to be connected to the flow variable input on the Weblog Reader node. To avoid cluttering the node, the flow variable input and output ports are not visible by default on the node display. To make them visible, right-click Weblog Reader Node > Show Flow Variable Ports. Two red dots become visible. They indicate the location of the flow variable ports. Connect the flow variable output node on the File Upload node to the flow variable input of the Weblog Reader node.

In the File Upload configuration dialog box, define a Variable Name such as accesslog_filename and browse for the access.log file that you want to analyze. After you select the access log file you want and save it, look in the node monitor for the File Upload node and you see two new variables, accesslog_filename and accesslog_filename (URL), which is a version of the file name string formatted as a URL for use in nodes that require a URL rather than a standard directory path to locate the file.

If you select other nodes in the workflow that are ready to run (these are marked with a yellow light icon yellow light), look in the node monitor to see that these new variables exist for all of them. In general, if you think of a workflow as a flow of information from one node to another, variables are available in nodes downstream from where they originated.

The value of the accesslog_filename variable sets the value of the Apache access.log to read in the Weblog Reader. On the Configure panel of the Weblog Reader, click the Flow Variables tab, which lists every configuration setting available on the Input Options tab that can be set with a flow variable, as show in Figure 2.

Expand the sources node on the dialog box. The source labeled 0 represents the first source on the dialog box. The drop down list contains all of the known variables available to this node. Select accesslog_filename (URL) as the variable containing the name of the access.log file.

Figure 2. Weblog Reader: Flow Variables dialog box
Weblog Reader Flow Variables dialog box

The value of the first input source file is now set with the file path specified in the File Upload node. You have now created a variable that represents the path to the access.log and you have specified that it be used in the Weblog Reader. You have completed the first half of the process. The other half of the process is to use a modification of this variable to set the name of the output CSV file, by performing the following steps:

  1. Using the method described earlier, make visible the flow variable ports for the Column Filter node connected to the CSV Writer.
  2. Make visible the flow variable ports for the CSV Writer.
  3. Select the Java Edit Variable (simple) node and put it on the work area.
  4. Draw a line that connects the flow variable output of the Column Filter to the flow variable input of the Java Edit Variable (simple) node.
  5. Draw a line that connects the flow variable output of the Java Edit Variable (simple) node to the flow variable input of the CSV Writer.

This process, shown in Figure 3, defines how variable information flows from the Column Filter node through the Java Edit Variable (simple) node to the CSV Writer. It is within the Java Edit Variable (simple) node where you use the variable accesslog_filename to create an output file name variable.

The final step is to define how to derive the output CSV file name from the input access.log file.

Figure 3. Workflow with Java Edit Variable node
Workflow with Java Edit Variable node

Open the configuration dialog box of the Java Edit Variable node in your workflow and add the content shown in Listing 1 in the Method Body part of the dialog, as shown in Figure 4.

Listing 1. Content to add to the Method Body
return $${Saccesslog_filename}$$.replace(".log", "_processed.csv");
Figure 4. Java Edit Variable containing code to modify filename
Java Edit Variable contains code to modify filename

To place the name of the flow variable into the method body, double-click the appropriate variable on the Flow Variable List. This action surrounds the variable with special formatting that indicates to the KNIME runtime that the variable is a flow variable that is to be replaced by the appropriate value. A new variable, output_filename, is defined. The .log extension of the log file is replaced with _processed.csv in the same directory and stored in a new variable, output_filename. Run this workflow to see that the name of the output CSV file containing the processed data is now a modification of the input access.log file.

Use meta nodes in KNIME workflows

To simplify a complex workflow and hide the implementation details that might not be useful to the user of the workflow, use meta nodes. A meta node is a user-defined node that combines a subset of a workflow into a single node.

For example, assume that you want to hide the Java Edit Variable node from the user of this workflow because its purpose is not obvious to a casual user. You combine the Java Edit Variable node and the CSV Writer node into a single node that indicates that it is to output the table to CSV.

Select the Java Edit Variable node and Shift-click to select the CSV Writer node so that both are selected. Right-click the selected nodes and then select Collapse into Meta Node from the context menu. After it is created, select the meta node and name it Output CSV file. A single node containing the variable and data flow inputs now replaces the Java Edit Variable and CSV Writer nodes. The work performed by this node is indicated by its name and the Java programming necessary to create the output filename is not visible, as shown in Figure 5. If you need to access these nodes you can double-click the meta node to access the specific nodes it contains. To update the individual nodes without pulling them out of the meta node, edit this view.

Figure 5. Workflow with meta node
Replaces Java Edit Variable node and CSV Writer node

R integration nodes

R project for statistical computing

According to the R Project website: R is a language and environment for statistical computing and graphics.

One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R uses a rectangular data structure called a data frame to hold tabular data. Recent versions of KNIME have R Interactive nodes that make it possible for data to flow from KNIME into an R data frame.

After data is stored in a data frame the full power of the R language is available for sophisticated tasks that can be performed programmatically with high-level language constructs specifically designed for the processing and analysis of data.

You can use this R integration to generate publication-quality plots of KNIME data by using an R package called ggplot2.

To use this capability, download R and install it on your system.

As shown in Figure 6, within the KNIME Labs group of nodes there is a set of R (Interactive) nodes that provide a full interactive R environment. In this example workflow you place an R View (Table) node on the workspace and make a normal KNIME connection from the output of the column filter to the input of this node for each plot you want to generate.

Figure 6. Workflow with output being sent to R View (Table) nodes
Output includes CSV file and R View (Tables)

Although a complete discussion of R and ggplot2 is beyond the scope of this article, the main idea is straightforward. You define a display theme for the plot; define the x and y variables from the node input, which is a workspace variable knime.in; and then define the plot type and how the data is to be aggregated.

Figure 7. R View (Table) dialog box in KNIME
R Snippet tab shows R script

The updated workflow that is available as a download generates three plots defined from the same, processed access-log data that is currently stored in a CSV file. The theme for each plot is almost the same, except for small changes to accommodate different plot types. Listings 2, 3, and 4 show the R code for each plot.

Listing 2. Scatter plot of country versus time, color-coded by requested URI
# Using the data from the node input (knime.in) generate a (default) scatter plot.
# Color each of the plotted points based in URI and apply graph theme and display plot
g1<- ggplot(knime.in, aes(x = knime.in$"time diff", y = knime.in$"Country"))
g1<- g1 + geom_point(aes(color=factor(URI))) + graph_theme
g1<
Listing 3. Bar chart of request counts for each country based on URI

Click to see code listing

Listing 3. Bar chart of request counts for each country based on URI

# Reorder the node input (knime.in) from largest to smallest total counts per country summing the counts from all URIs
# Using the data from the node input (knime.in) generate a bar chart.
# Color each of the bars based in URI, flip graph coordinates and apply graph theme and display plot
knime.in = transform(knime.in, Country=reorder(Country, 1+numeric(dim(knime.in)[1]),  FUN=sum ) )
g1<- ggplot(knime.in, aes(x = knime.in$"Country"))
g1<- g1 +  geom_bar(aes(fill = URI)) + coord_flip() + graph_theme
g1
Listing 4. Bar chart of all request counts for each URI

Click to see code listing

Listing 4. Bar chart of all request counts for each URI

# Reorder the node input (knime.in) from largest to smallest total counts per URI summing the counts from all countries
# Using the data from the node input (knime.in) generate a bar chart.
# Color each of the bars grey, flip graph coordinates and apply graph theme and display plot
knime.in = transform(knime.in, URI=reorder(URI, 1+numeric(dim(knime.in)[1]),  FUN=sum ) )
g1<- ggplot(knime.in, aes(x = knime.in$"URI"))
g1<- g1 + geom_bar(fill = "grey") + coord_flip() + graph_theme
g1

Conclusion

Agile data analysis in a production environment can greatly enhance understanding the results of automated tests. The more information that can be extracted from tests with a minimum of user interaction, the more quickly we can respond to critical or error conditions and determine if further investigation or testing is required.

By automating KNIME workflows and integrating them with the analysis and graphing capability of R we can perform agile analysis that is useful to testers, developers, and domain experts as well as others that need to understand the consequences of specific tests, such as marketing and sales professionals.


Downloads

DescriptionNameSize
Updated workflow: Variables, meta nodes, R integ.Process-Apache-access-logs-w_R.zip53KB
Screen capture: scatter plot of country vs. timer_article_timeseries.zip296KB
Screen capture: Bar chart - req cnts for each URIr_article_country.zip50KB
Screen capture: bar chart of request counts URIr_article_uri.zip13KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into DevOps on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • DevOps digest

    Updates on continuous delivery of software-driven innovation.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • IBM evaluation software

    Evaluate IBM software and solutions, and transform challenges into opportunities.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=DevOps
ArticleID=972813
ArticleTitle=Agile data analysis - Advanced workflows and R integration
publish-date=06102014