Benefits of applying clustering algorithms to finding change request patterns

Group change requests according to text similarity

This article describes an approach for analyzing IBM Rational Team Concert change request patterns by applying machine learning techniques, specifically clustering algorithms, to group the change requests according to text similarity. By performing this analysis, software development projects benefit in quality improvement, reuse, process, and team collaboration. The analysis described in this article as an example relies on the Apache Mahout library implementation of the k-means clustering algorithm.

Luis Quintela (lquintela@us.ibm.com), Solution Architect, IBM

Photo of Luis QuintelaLuis Quintela has been a Solution Architect in the ISSR Integration Engineering Team since July 2010, focused on designing integrations and extensions to the IBM Rational software product portfolio. Luis joined Rational Software in 1997. He served as an IT specialist until 2001 and then as the Technical Sales Lead for the Silicon Valley and San Francisco districts through IBM's acquisition of Rational software in 2003. After the acquisition, Luis served as the Rational Services Manager for the San Francisco Bay Area until 2007, when he joined the Community of Practice team, initially in the Configuration and Change Management Community and later, after January 2009, as the Lead for the Software and Systems Delivery Platform Community. As the Community of Practice Lead, Luis managed the Insight Regional Mentor Program and contributed to several other initiatives as part of the IBM Rational Insight launch, such as developing the Learner portal and the proof-of-technology material. After the launch, Luis worked with customers interested in deploying a Rational Insight-based performance measurement solution. Over the last 15 years, Luis has worked in selling and deploying Rational solutions to the worldwide customer base. He received both his B.Sc. in Computer Science and his M.Sc. in Distributed Systems and Parallel Architectures from the Federal University of Rio de Janeiro, in Brazil.



17 April 2012

Also available in Chinese Spanish

Advantages of grouping change requests

Change requests can be easily consolidated according to structured data for which the value domain is well-defined, such as attributes of a given enumeration type. Questions such as how many change requests have been submitted against a category, priority or severity level, and which ones are still in the Open state can be answered with a simple IBM® Rational Team Concert™ query.

Grouping change requests based on unstructured data is not trivial. However, there are advantages that can be gained only by deriving intelligence from this type of data.

This article describes an approach for analyzing Rational Team Concert change request patterns by tokenizing the text attributes (summary, description, and comments) and applying machine learning techniques, specifically clustering algorithms that group change requests by similarity. By performing this type of analysis, software development teams benefit in the following areas:

Quality improvement
If too many change requests are associated with the same general theme, there might be an opportunity for improving the process related to that area that would decrease the number of future issues.
 
Reuse
Change requests that belong to the same grouping might be solved by a similar approach, following the same overall framework or applying the same solution pattern.
 
Finding duplicates
It is more efficient to search for duplicates before submitting a new change request by looking at similar requests.
 
Collaboration patterns
Understanding which team members contribute to solving similar change requests can help substantiate organizational change decisions, refine career goals, and develop or improve skills.

Applying clustering algorithms

Algorithms that are capable of learning by analyzing structured and unstructured data and that make decisions based on this acquired intelligence are called machine learning algorithms. Several successful companies today, such as Google and Netflix, depend on these algorithms for refining search engine results or recommending new items to users by analyzing the similarity of their preferences to other users' preferences. Machine learning techniques and algorithms fall into three main categories:

Recommendation (collaborative filtering)
The capability of picking products or items for a customer by inferring preference, based on previous behavior
 
Clustering
The capability of grouping a large number of items into clusters, based on learned similarity
 
Classification
The capability of placing items into a predefined set of classes through training with a known preclassified data set
 

This article focuses on clustering as an approach for analyzing change request patterns.

Clustering is primarily focused on the concept of similarity between items in a collection, as determined by the distance between items in a multidimensional space. Two items belong to the same cluster if the distance between them is small enough.

Items are represented as vectors, with each attribute as a dimension. By assigning numerical values to each attribute, the item can be mapped to a point in the multidimensional space, in which each value becomes a coordinate in a dimension. The distance between two items can then be calculated by applying a measure such as Euclidean or Tanimoto distance, as specified below:

Considering two vectors:Two vectors with n dimensions

Euclidean distance:Formula for Euclidean distance between 2 vectors

Tanimoto distance:Formula for Tanimoto distance between 2 vectors

There are several clustering algorithms available that can be used for exclusive, overlapping, hierarchical, or probabilistic clustering. For this work, the k-means clustering algorithm was selected for its simplicity and efficiency. The algorithm creates mutually exclusive clusters and takes as input the number of clusters to be generated, the initial centers (vectors) for each cluster, the distance measure, the maximum number of iterations, and the convergence threshold. Figure 1. Clusters of points in a plane illustrates three clusters created from a set of vectors in bi-dimensional space, given an initial set of centers (c1, c2 and c3).

In each iteration, k-means applies the distance measure to select the set of points that are closest to each of the centers and then readjusts the center, given the cluster of points produced in the iteration. The processing ends when either the maximum number of iterations is reached or the distance between centers in contiguous iterations is less than or equal to the convergence threshold.

Figure 1. Clusters of points in a plane
Coordinates grouped in clusters with initial centers

Using Apache Mahout with clustering

Mahout is a top-level Apache Software Foundation project that started as a sub-project under the Lucene umbrella, which provides a Java search library. Apache Mahout is also a Java library. It focuses on machine learning algorithms and provides implementation of algorithms in the three main categories of collaborative filtering, clustering, and classification.

Specifically for clustering, Mahout supports k-means, fuzzy k-means for overlapping clustering, canopy generation for estimation of the number of clusters and location of the approximate center point, and the Dirichlet algorithm for model-based (probabilistic) clustering.

Mahout provides implementations of those algorithms that can be run in a single process or can leverage the Hadoop MapReduce Engine and Distributed File System (HDFS) for scaling horizontally in a cluster of processing nodes. When using HDFS, vectors are saved in a Hadoop Sequence File. The algorithm runs as a set of map and reduce tasks that create adjusted clusters in each iteration. These clusters are also persisted as Sequence Files and fed as input to the subsequent iteration.

The analysis described in this article was done with the k-means algorithm implementation in the Mahout 0.5 release, running in a single node.


Finding patterns

Categorization of change requests by using "order by" clauses in a query engine has limitations, because the grouping is dominated by the first attribute in the clause, with the subsequent attributes contributing only to additional partitioning of the first-level grouping.

The approach outlined here concerns finding these groupings (clusters) by using text attributes such as summary, description, and comments to represent each change request as a multidimensional vector . Subsequently, the Mahout implementation of the k-means algorithm is used, with the Tanimoto measure for vector distance calculation.

The representation of change requests as vectors from the text content can be done by assigning a dimension (vector index) to each word that occurs in the entire set of change requests. The number of occurrences of each specific word (term frequency) in a given change request becomes the word's weight, which is assigned to the corresponding vector index value.

Common occurrences of sequences with two or more words can also be considered part of the vocabulary. In this analysis, bi-grams (combination of two terms) were included in the vector generation process.

Including comments as part of the text analyzed has the benefit of taking user names (comment authors) into account. Some of the clusters generated by the algorithm reveal names as top terms (number of occurrences) that characterize the cluster. This helps identify team members who contribute the most to certain classes of change requests.

The input data was generated by running the Rational Team Concert work item export tool, which creates a .csv file that includes the work item ID, summary, description, and comments in double quotation marks for a set of 500 sample tasks and change requests. The next step in preparing the input data is to create a Hadoop Sequence File with a set of key and value pairs, one pair per change request. The key is the work item ID and the value is the concatenation of the summary, description and comment attributes.

This Sequence File is used as input to tokenize the documents with an analyzer that specializes the White Space tokenizer class from the Lucene library. The custom analyzer includes only tokens with more than three alphabetic characters. The result is also stored as a Sequence File and fed as input to the method that generates the vectors from the term frequency analysis.

The final step is to run the k-means algorithm, which created twenty clusters from the 500 work items, starting with a set of centers selected randomly. You can improve results by departing from a preselected set of themes and manually choosing one representative for each theme as the initial center for the theme clusters.

Table 1 summarizes some of the useful clusters that were identified (the user names that surfaced in the analysis as top terms were omitted).

Table 1. Top terms in a subset of clusters
Most frequent termsNumber of change requests
AppScan, Security, Testing 27
CLM, Upskilling, PSO, Pilot 24
Measurement, Performance, Insight 36
CLM, RTC, RRC, RQM 32
Connections, Communities 23
Systems, Systems Engineering, Process 26
Requirements, RRC, Management 3
Architecture, RTC, Migration 21

Clear patterns emerge from the analysis with a mix of specific domain vocabulary (security, measurement), activities (upskilling, pilot), and product names (AppScan, Rational Team Concert) combined in the same cluster.

Some of the uncovered clusters are not that interesting, but they could be eliminated in future algorithm runs by creating a list of terms that should be filtered out when tokenizing the text.

Table 2. Top terms that skew the analysis in some clusters
Most frequent termsNumber of change requests
Attachment, added, Rational 28
Work, have, can, you 106
Presentation, application, services 11

Future research avenues

There are a few ways to expand on this initial work. The following avenues could improve the quality of the clusters and help find other patterns, such as team collaboration.

Collaboration networks from comments

Given that comments in Rational Team Concert contain the author's user ID, it is possible to build a graph of team communication by looking at the user IDs that add comments in the same change request. The graph nodes would represent users, and there would be a link between two nodes if the corresponding users have entered at least one comment in the same change request. The thickness of the link would indicate in how many common change requests the users have contributed comments.

Clustering upon submission

Change requests could be assigned immediately after submission to predefined clusters that were created by running the clustering algorithm in batch mode on the previous night. That would help in determining potential solutions for the new change request.

Mixing structured and unstructured attributes

The creation of vectors that represent the change requests can take into account a combination of unstructured text attributes and structured attributes that have a finite domain of values. This approach could improve the quality of the uncovered clusters.

Including attachments in the analysis

Apache Tika is a framework for identification of the media type and extraction of the text and metadata. It could be used to extract text from change request attached documents. The text would be tokenized and included in the vector generation process.

Resources

Learn

Get products and technologies

  • Download Rational Team Concert from Jazz.net and try it for free on up to 10 projects for as long as you want (requires registration).
  • Evaluate IBM software in the way that suits you best: Download it for a trial, try it online, use it in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Rational software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational
ArticleID=810167
ArticleTitle=Benefits of applying clustering algorithms to finding change request patterns
publish-date=04172012