Advantages of grouping change requests
Change requests can be easily consolidated according to structured data for which the value domain is well-defined, such as attributes of a given enumeration type. Questions such as how many change requests have been submitted against a category, priority or severity level, and which ones are still in the Open state can be answered with a simple IBM® Rational Team Concert™ query.
Grouping change requests based on unstructured data is not trivial. However, there are advantages that can be gained only by deriving intelligence from this type of data.
This article describes an approach for analyzing Rational Team Concert change request patterns by tokenizing the text attributes (summary, description, and comments) and applying machine learning techniques, specifically clustering algorithms that group change requests by similarity. By performing this type of analysis, software development teams benefit in the following areas:
- Quality improvement
- If too many change requests are associated with the same general theme, there might be an opportunity for improving the process related to that area that would decrease the number of future issues.
- Change requests that belong to the same grouping might be solved by a similar approach, following the same overall framework or applying the same solution pattern.
- Finding duplicates
- It is more efficient to search for duplicates before submitting a new change request by looking at similar requests.
- Collaboration patterns
- Understanding which team members contribute to solving similar change requests can help substantiate organizational change decisions, refine career goals, and develop or improve skills.
Applying clustering algorithms
Algorithms that are capable of learning by analyzing structured and unstructured data and that make decisions based on this acquired intelligence are called machine learning algorithms. Several successful companies today, such as Google and Netflix, depend on these algorithms for refining search engine results or recommending new items to users by analyzing the similarity of their preferences to other users' preferences. Machine learning techniques and algorithms fall into three main categories:
- Recommendation (collaborative filtering)
- The capability of picking products or items for a customer by inferring preference, based on previous behavior
- The capability of grouping a large number of items into clusters, based on learned similarity
- The capability of placing items into a predefined set of classes through training with a known preclassified data set
This article focuses on clustering as an approach for analyzing change request patterns.
Clustering is primarily focused on the concept of similarity between items in a collection, as determined by the distance between items in a multidimensional space. Two items belong to the same cluster if the distance between them is small enough.
Items are represented as vectors, with each attribute as a dimension. By assigning numerical values to each attribute, the item can be mapped to a point in the multidimensional space, in which each value becomes a coordinate in a dimension. The distance between two items can then be calculated by applying a measure such as Euclidean or Tanimoto distance, as specified below:
Considering two vectors:
There are several clustering algorithms available that can be used for exclusive, overlapping, hierarchical, or probabilistic clustering. For this work, the k-means clustering algorithm was selected for its simplicity and efficiency. The algorithm creates mutually exclusive clusters and takes as input the number of clusters to be generated, the initial centers (vectors) for each cluster, the distance measure, the maximum number of iterations, and the convergence threshold. Figure 1. Clusters of points in a plane illustrates three clusters created from a set of vectors in bi-dimensional space, given an initial set of centers (c1, c2 and c3).
In each iteration, k-means applies the distance measure to select the set of points that are closest to each of the centers and then readjusts the center, given the cluster of points produced in the iteration. The processing ends when either the maximum number of iterations is reached or the distance between centers in contiguous iterations is less than or equal to the convergence threshold.
Figure 1. Clusters of points in a plane
Using Apache Mahout with clustering
Mahout is a top-level Apache Software Foundation project that started as a sub-project under the Lucene umbrella, which provides a Java search library. Apache Mahout is also a Java library. It focuses on machine learning algorithms and provides implementation of algorithms in the three main categories of collaborative filtering, clustering, and classification.
Specifically for clustering, Mahout supports k-means, fuzzy k-means for overlapping clustering, canopy generation for estimation of the number of clusters and location of the approximate center point, and the Dirichlet algorithm for model-based (probabilistic) clustering.
Mahout provides implementations of those algorithms that can be run in a single process or can leverage the Hadoop MapReduce Engine and Distributed File System (HDFS) for scaling horizontally in a cluster of processing nodes. When using HDFS, vectors are saved in a Hadoop Sequence File. The algorithm runs as a set of map and reduce tasks that create adjusted clusters in each iteration. These clusters are also persisted as Sequence Files and fed as input to the subsequent iteration.
The analysis described in this article was done with the k-means algorithm implementation in the Mahout 0.5 release, running in a single node.
Categorization of change requests by using "order by" clauses in a query engine has limitations, because the grouping is dominated by the first attribute in the clause, with the subsequent attributes contributing only to additional partitioning of the first-level grouping.
The approach outlined here concerns finding these groupings (clusters) by using text attributes such as summary, description, and comments to represent each change request as a multidimensional vector . Subsequently, the Mahout implementation of the k-means algorithm is used, with the Tanimoto measure for vector distance calculation.
The representation of change requests as vectors from the text content can be done by assigning a dimension (vector index) to each word that occurs in the entire set of change requests. The number of occurrences of each specific word (term frequency) in a given change request becomes the word's weight, which is assigned to the corresponding vector index value.
Common occurrences of sequences with two or more words can also be considered part of the vocabulary. In this analysis, bi-grams (combination of two terms) were included in the vector generation process.
Including comments as part of the text analyzed has the benefit of taking user names (comment authors) into account. Some of the clusters generated by the algorithm reveal names as top terms (number of occurrences) that characterize the cluster. This helps identify team members who contribute the most to certain classes of change requests.
The input data was generated by running the Rational Team Concert work item export tool, which creates a .csv file that includes the work item ID, summary, description, and comments in double quotation marks for a set of 500 sample tasks and change requests. The next step in preparing the input data is to create a Hadoop Sequence File with a set of key and value pairs, one pair per change request. The key is the work item ID and the value is the concatenation of the summary, description and comment attributes.
This Sequence File is used as input to tokenize the documents with an analyzer that specializes the White Space tokenizer class from the Lucene library. The custom analyzer includes only tokens with more than three alphabetic characters. The result is also stored as a Sequence File and fed as input to the method that generates the vectors from the term frequency analysis.
The final step is to run the k-means algorithm, which created twenty clusters from the 500 work items, starting with a set of centers selected randomly. You can improve results by departing from a preselected set of themes and manually choosing one representative for each theme as the initial center for the theme clusters.
Table 1 summarizes some of the useful clusters that were identified (the user names that surfaced in the analysis as top terms were omitted).
Table 1. Top terms in a subset of clusters
|Most frequent terms||Number of change requests|
|AppScan, Security, Testing||27|
|CLM, Upskilling, PSO, Pilot||24|
|Measurement, Performance, Insight||36|
|CLM, RTC, RRC, RQM||32|
|Systems, Systems Engineering, Process||26|
|Requirements, RRC, Management||3|
|Architecture, RTC, Migration||21|
Clear patterns emerge from the analysis with a mix of specific domain vocabulary (security, measurement), activities (upskilling, pilot), and product names (AppScan, Rational Team Concert) combined in the same cluster.
Some of the uncovered clusters are not that interesting, but they could be eliminated in future algorithm runs by creating a list of terms that should be filtered out when tokenizing the text.
Table 2. Top terms that skew the analysis in some clusters
|Most frequent terms||Number of change requests|
|Attachment, added, Rational||28|
|Work, have, can, you||106|
|Presentation, application, services||11|
Future research avenues
There are a few ways to expand on this initial work. The following avenues could improve the quality of the clusters and help find other patterns, such as team collaboration.
Collaboration networks from comments
Given that comments in Rational Team Concert contain the author's user ID, it is possible to build a graph of team communication by looking at the user IDs that add comments in the same change request. The graph nodes would represent users, and there would be a link between two nodes if the corresponding users have entered at least one comment in the same change request. The thickness of the link would indicate in how many common change requests the users have contributed comments.
Clustering upon submission
Change requests could be assigned immediately after submission to predefined clusters that were created by running the clustering algorithm in batch mode on the previous night. That would help in determining potential solutions for the new change request.
Mixing structured and unstructured attributes
The creation of vectors that represent the change requests can take into account a combination of unstructured text attributes and structured attributes that have a finite domain of values. This approach could improve the quality of the uncovered clusters.
Including attachments in the analysis
Apache Tika is a framework for identification of the media type and extraction of the text and metadata. It could be used to extract text from change request attached documents. The text would be tokenized and included in the vector generation process.
- Recommended reading:
- Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko. Manning Publications, 2009.
- Lucene in Action , Second Edition: Covers Apache Lucene 3.0 by Michael McCandless, Erik Hatcher, and Otis Gospodnetic. Manning Publications, 2010.
- Mahout in Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Manning Publications, 2011.
- Tika in Action by Chris Mattmann and Jukka Zitting. Manning Publications, 2011.
- Hadoop: The Definitive Guide by Tom White. Third edition. O'Reilly Media, 2012.
- Learn more about Rational Team Concert:
- Find Rational Team Concert articles and links to many other resources on IBM developerWorks, and check the product overview page, features and benefits, system requirements, and the user information center.
- Check the Rational Team Concert page on Jazz.net.
- Watch the Using Rational Team Concert in a globally distributed team webcast or a demonstration of the Dashboards and reports, or listen to the podcast about IBM Rational Team Concert and Jazz.
- Visit the Rational software area on developerWorks for technical resources and best practices for Rational Software Delivery Platform products.
- Subscribe to the developerWorks weekly email newsletter, and choose the topics to follow. Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Improve your skills. Check the Rational training and certification catalog, which includes many types of courses on a wide range of topics. You can take some of them anywhere, any time, and many of the "Getting Started" ones are free.
Get products and technologies
- Download Rational Team Concert from Jazz.net and try it for free on up to 10 projects for as long as you want (requires registration).
- Evaluate IBM software in the way that suits you best: Download it for a trial, try it online, use it in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.
- Join the Rational Team Concert discussions or ask questions in the Jazz.net forums.
- Rate or review Rational software. It's quick and easy.
- hare your knowledge and help others who use Rational software by writing a developerWorks article. Find out what makes a good developerWorks article and how to proceed.
- Follow Rational software on Facebook, Twitter (@ibmrational), and YouTube, and add your comments and requests.