Agile data analysis

Analyzing data in an agile world

This article describes techniques and tooling that can empower testers and other consumers of measurement data to interpret results in an adaptable way that should make data analysis more interactive.

Scott Snyder (shsnyder@us.ibm.com), Senior Performance Architect, IBM

Scott has been working to improve the performance of server side technologies for fifteen years. Using various load testing and analysis techniques he has been able to unravel the root causes for many performance and scalability problems.



22 October 2013

Also available in Portuguese

The purpose of testing is to identify problems and defects in a product. While some tests are pass/fail many require significant analysis of measurement data to learn something about the system under test. Agile software development allows for a changing and dynamic feature set to accomplish rapid evaluation of features. Current methods of data analysis rely primarily on static scripts written either in compiled or interpreted programming languages (Perl, Python, and so on). Although the use of dynamic languages can greatly facilitate the analysis process due to its rapid development cycle, the individuals working most closely with the tests may not have access to the source code of the analysis software or the skill set to make the necessary changes.

Static analysis in an agile world

Test teams run tests on software during development to identify problems. In a traditional waterfall process the functional requirements are known in advance and are implemented on a schedule. Agile development allows for an application's functions to change over time to meet changing customer requirements. To meet these new dynamic test requirements there has been an explosion of new test methodologies. Test and behavior-driven development methodologies have been developed to support these short development cycles, with new dynamic and declarative environment configuration tools such as Puppet and Chef being used to quickly deploy and configure deployment environments.

But what about investigation of non-functional requirements? Function tests are pass/fail in nature: either the specific function is implemented correctly or it is not.

Many requirements are non-functional in nature, and success or failure is not as simple as passing a specific functional requirement. Load and stress tests, performance testing, and capacity determination are examples of tests that are not binary but require active investigation and analysis to determine whether an application meets the non-functional requirement.

Current methods of data analysis rely primarily on static scripts written either in compiled or interpreted programming languages (Perl, Python, Ruby, and so on). For analyses that have to be run many times on a largely static system or slowly changing codebase, an investment in development of analysis tooling and software can be justified, as these tools will be useful for a long period of time and can help identify the root cause of performance or stability problems, for example. However for a codebase that is rapidly changing both in content and capability, the time spent in developing custom analysis tooling will be mostly wasted, because the test team is constantly chasing a moving target.

Although the use of dynamic languages can greatly facilitate the analysis process due to their rapid development cycle, the individuals working most closely with the tests may not have access to the source code of the analysis software or the skill set to make the inevitable changes that will be necessary when the application changes.

In this article I describe agile data analysis techniques and tooling that can empower testers and other consumers of measurement data to interpret results in an adaptable way. These tools and techniques make data analysis more interactive and minimize the need for ongoing script rewrites.


Why agile data analysis?

Non-functional tests require a more in-depth understanding of the system being measured because success or failure is a result of meeting criteria that is specified by a system operational characteristic. Slow response times, non-linear scaling in a cluster, excessive memory consumption (garbage collection) or excessive network traffic are examples of undesirable system behaviors that can be discovered in a running system. These indicators require a deeper understanding of how the system operates to understand their origin.

We perform data analysis to extract information from the various sources of data available in a running system so that the origins of these undesirable behaviors can be discovered. For an established product with a fixed set of requirements, analysis tools can be written that correlate the multiple sources of data so that system health can be monitored. But during agile development, the program or system being developed can change dramatically from build to build as the codebase responds to changes in functional requirements or user interface. The tool set to interpret all of the sources of data needs to be agile as well.

To quote from wikipedia,
"A highly composable system provides recombinant components that can be selected and assembled in various combinations to satisfy specific user requirements."

But what does it mean for a set of tools to be agile? The classic example of an agile tool set are the tools and utilities that ship with UNIX/Linux. Each utility performs a narrowly defined task, and these tasks can be chained together, either on the command line or in a shell script, to perform an endless variety of tasks that would be impossible to duplicate if each task required a dedicated application. It is not necessary for a user of these utilities to know how each works. It's only necessary to understand how to combine their functionality to solve a particular problem. Therefore, the first requirement for agile data analysis is composability.


Agile data formats

Examples of composability include UNIX utilities, which pass text files between tools. Microsoft's PowerShell tool is also composable but passes instances of PowerShell objects that are known to the operating system. But what type of data format should be passed between components in an agile data analysis system?

Data obtained from a running system can originate from many different places and in many formats. Some examples are:

  • Unstructured log files
  • Structured log files
  • System environment data
    • Percentage of CPU that is busy
    • Memory consumed
    • Disk I/O
  • Measured data
    • Spreadsheets
    • HTML-based reports

These examples are sources of data that need to be parsed and interpreted to do any sort of data analysis. The challenge is to reduce all of this data, which is in a variety of data formats, into a form where comparisons between the various types of data can be performed.

As an example, suppose a load test is being performed on a server product using a tool that simulates multiple browser users. Several open source and commercial tools can perform such a load measurement. The tool gives you the response time for each request and you want to know when the server response time exceeds a specific limit. The response time can grow for many different reasons: the number of concurrent users, excessive memory consumption that can cause garbage collection delays, network saturation, and so on.

To determine the source of the response time delay, examine data in all of the forms outlined above: from the operating system, from the server under test, and from the load tool.

There have been attempts to standardize formats for data such as XML or JSON (largely for data being consumed by the browser). A format like XML, which is relatively easy to parse, is so structured that it might require a significant amount of work to convert the data in the file into a form that could be interpreted by another tool or compared to other non-XML content. The simplest data format that has almost universal support is the rectangular table; where each row contains a collection of attributes associated with a unique data instance. For example, for the load measurement, each row in the table can represent the relevant data available at a particular point in time during the test. Tabular data is directly exportable to spreadsheets and databases through disk formats such as comma separated value (CSV).


Analysts or programmers?

As software applications get larger and more complex, automated testing becomes a necessity. Even those who are experts at automating complex scenarios for the purpose of determining the quality and robustness of a specific software system might have no understanding of the internals of the software that they are testing. This black box testing does not require programming except, perhaps, for the definition of the test scenario.

Non-functional tests can also be driven from an automated test suite but the results cannot be treated like a black box. If our data analysis is to be agile, interpreting the results of an automated non-functional measurement would require minimal or no programming at all, just a tool to combine and correlate the various sources of data.

Data analytics is a different skill set than programming, and requires a more in-depth understanding of the program's architecture, and the available diagnostic information from the various components of the program and its deployment environment. Depending on the volume of data acquired and its complexity, data mining and statistical analysis techniques may also be necessary to be interpret the data to determine the source of a non-functional defect.

To accommodate the possibility that the testers and analysts interpreting an automated measurement of an evolving software application do not have programming skills, our agile data analysis tooling should require minimal programming to be useful.


Summary of requirements for agile data analysis

In summary, to perform flexible data analysis for an agile software project, the tooling used must meet these three main criteria:

  • Functionality must be composable using existing functional components
  • Data exchange between components must appear to the user as a table
  • Composition of components should be done with a minimum of programming required.

Existing analysis tooling such as data mining and business analytics toolkits, meet these requirements. These toolkits are meant to provide out of the box data mining and analysis capabilities to non programmers.

Check out IBM SPSS Modeler

IBM SPSS Modeler is a data mining workbench that helps you build predictive models quickly and intuitively, without programming. Use it to dig deep into your data for insightful, profitable answers. Use these demos and videos to learn more.

IBM SPSS Modeler has very sophisticated statistical modeling capabilities to analyze large and complex data sets. While certainly capable of performing data analysis for software application testing, it is probably most useful for very large data sets with complex statistical analysis requirements.

Open source products that can be used for more modest analysis tasks are probably more applicable to the type of agile data analysis being described here. One of the more popular open source tools available is called KNIME (see Resources for a link).

KNIME (the Konstanz Information Miner)

According to the KNIME website,
"KNIME [naim] is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes), including those of the KNIME community and its extensive partner network."

KNIME is an Eclipse-based framework that was initially developed in 2006 to perform data analysis for the pharmaceutical industry. Since that time it has evolved to become a general purpose data analytics, reporting, and integration platform. It is the ideal basis of a data analysis platform for investigating the non-functional capabilities of software developed using agile methodology. It meets the requirements specified above through the use of workflows that are displayed graphically as a set of nodes linked together by arrows that indicate the direction that data flows.

By wiring these nodes together you can implement a data analysis task with little or no programming. All of the analysis logic resides in the nodes which are used to consume external sources of data (through "reader" nodes); transform the internalized data; or combine it with other sources of data. At each node the current state of the manipulated data can be examined in a spreadsheet-like display, so you can validate that the data has been transformed correctly for a particular analysis task.

Figure 1. Sample KNIME desktop
knime desktop showing nodes

Conclusion

In the next article in this series I will show you the steps needed to build a simple workflow, how to import data into the workflow, and how to execute logic that is commonly performed in analyzing time based data.

Resources

Learn

Get products and technologies

  • Download KNIME desktop software from the main KNIME site.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into DevOps on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • DevOps digest

    Updates on continuous delivery of software-driven innovation.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • IBM evaluation software

    Evaluate IBM software and solutions, and transform challenges into opportunities.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=DevOps, Rational
ArticleID=950136
ArticleTitle=Agile data analysis
publish-date=10222013