Skip to main content

Measures of success: RUP and the scientific method

Gary Pollice, Professor of Practice, Worcester Polytechnic Institute
Author photo
Gary Pollice is a professor of practice at Worcester Polytechnic Institute, in Worcester, MA. He teaches software engineering, design, testing, and other computer science courses, and also directs student projects. Before entering the academic world, he spent more than thirty-five years developing various kinds of software, from business applications to compilers and tools. His last industry job was with IBM Rational software, where he was known as "the RUP Curmudgeon" and was also a member of the original Rational Suite team. He is the primary author of Software Development for Small Teams: A RUP-Centric Approach, published by Addison-Wesley in 2004. He holds a BA in mathematics and an MS in computer science.

Summary:  from The Rational Edge: If your RUP-based projects are successful, how do you know that your team’s use of RUP is the reason for that success? Here, Gary Pollice suggests a method for scientifically measuring several iterative development techniques.

Date:  14 Jul 2006
Level:  Introductory
Activity:  413 views

illustrationI've said more than once over the last couple of years that "software engineering" is a misnomer. We actually practice software development.1 Philippe Kruchten and others have said that two things that make software different from other engineering disciplines is that each software development project is unique, and there are no fundamental laws that apply to software. Does this mean that we should abandon all hope of finding fundamental laws and developing a more engineering-like approach to developing software?

Not at all. Software is still a young discipline, and we have a lot of basic and applied research ahead of us in order to discover its laws. We also need to understand what parts of software development are amenable to more rigorous methods and under what conditions. The scientific method requires that we observe (software-related) phenomena, formulate hypotheses, use those hypotheses to predict future behavior, and validate or disprove our hypotheses.

If you use a process in your software development practice, you are already applying the scientific method, at least indirectly. Let me explain what I mean.

  • First, you observe your organization's or development teams' effectiveness. You see that they are effective, but you think they can be more so. You begin to think that a change in process might help. You observe the software-related phenomena and start to formulate an hypothesis.
  • Next, you decide what process to use. What you are really doing at this stage is basing your process choice on observations and experience from previous projects.
  • You select your process because you want to improve your project's chance of success. Thus, you have effectively formulated an hypothesis about the efficacy of the process.
  • Ideally, you configure the process for the project and team. You are also indirectly predicting the future of your project/process when you choose and configure the process for your project.
  • Now you must collect data and use it to validate your hypothesis. If the hypothesis does not match the observations, you must revise your hypothesis. You must also be able to perform the prediction and validation repeatedly in order to state that the hypothesis has been proved; i.e., repeatable results are essential within the scientific method.

I want to consider this last step in detail -- collecting data to validate the benefits of a process.

Let's assume that you determine, in whatever way is appropriate for your organization, that a project is successful. Naturally, you assume the process you selected is, in part, responsible for the success of the project. You send out a message to your organization about your success, and you urge others to use your process configuration if they have similar projects. You have data -- gathered from the successful project itself -- to back up your claims that the process (and your wisdom in selecting it!) is responsible for this success. Solid reasoning, right?

The next day, you receive a summons from the executive committee to appear before them to justify your claims. You gather your data and give a wonderful slide presentation of the data you collected -- the defect rate per 1,000 lines of code, the team's productivity, and so on -- and how it clearly shows that the project was an unqualified success. You repeat your claim that a contributing factor to the success is that you had the right process.

At this point, the grand executive inquisitor looks at you and asks a question for which you have no answer. "Did your team follow the process?" The only replies that come to your mind are "Of course, they told me they did," and "They must have followed the process, since the project was successful." You realize that neither of these replies sufficiently answers the question. Even though you've done a great job of collecting data and measuring several properties of the project, you didn't collect any data on how well the process was followed.

Yes, the team told you they followed the process, and they honestly believe that they did. But that doesn't mean that they really did follow it, especially given that the process you've chosen was new to the organization at the project's inception. They might have been more careful about the way they built the system, simply because they knew you were going to be collecting data on their performance. This is an example of the Hawthorne Effect.2 So the question remains: How do you know what to measure?

Measuring the Rational Unified Process

When you perform an experiment, you must gather data on the process you use as well as the results of the experiment. If you make a claim about the results, you must provide enough information that the experiment can be reproduced by another researcher or research team. The same is true for software "experiments." Problems arise because we often do not know what data to capture and, after you capture it, how to analyze it. The rest of this article suggests some ways to measure your process, with the goal of determining how much of the process your team actually used. If you can gather this data, you can then determine those techniques that are most effective in your environment.

The Rational Unified Process®, or RUP®, contains guidance for all aspects of software development. In order to use it successfully, you must create a process configuration that is appropriate for your project. Most RUP configurations are based upon a set of best practices and other techniques that have been shown to be effective for different types of software projects.3

Our approach is simple. We will select a practice and look at ways to measure how well the team actually adopts the practice. We will use the Goal Question Method (GQM) technique.4 Our goal is straightforward. We want to know if the team actually uses the process. Next, we decide upon questions we might ask such that the answers will help us determine whether we met the goal or not. Finally, for each question, we need to decide what measurement will provide an answer.

Can't we simply ask the team if they followed the process? you might ask. We can, and we should, but the response is more subjective than objective. In fact, people will often either tell you what they think you want to know, or they will tell you what they believe to be true, whether it's true or not. For example, if you ask someone how they turn their bicycle to the left, they'll tell you they simply turn the handlebars to the left and the bike turns. But if they think a little bit more about the problem, they'll tell you that they lean their body to the left as well. And if they very carefully observe themselves turning a bike to the left, in fact, the first thing they do is turn the handlebars slightly to the right. This small shift enables them to "fall" to the left and then turn the handlebars to the left to make the turn. In other words, what we need is empirical data -- based upon observations of the team in action or data we can extract by other methods -- that will let us decide if they followed the process and to what degree.

We will look at just three basic areas of RUP: requirements management, iterative development, and testing. There are too many areas and combinations to try and address all of them. Our goal, in this article, is to see how we can select a technique or method and apply measurement to it in order to better understand the effectiveness of our process.

Requirements management

Most RUP configurations are use-case centric. This means that for functional requirements (requirements regarding what the system must actually do), we apply use cases to describe the system's behavior. What are some questions we might ask to determine if the team really uses use cases to describe the system from a functional viewpoint?5 One that comes to my mind is: "Are all features6 that are implemented in the code described in the use cases?" This seems fairly straightforward. If we can identify the features implemented, we can compare them against the features described in the use cases. The difficulty is determining the features that are actually implemented. If we can do this, we can calculate the following use-case effectiveness metric:

u=F subscript s over F subscript i

where Fi is the number of all features implemented and Fs is the number of features that were both implemented and specified. If you multiply U by 100, you have the percentage of implemented features that were actually specified in your use cases.

Typically, you will specify more features in your use cases than you actually implement. You end up removing some as you manage the scope in order to deliver a usable product on time. So Fs must include only those features that you actually implemented from your specification. Now the range of the metric, U, has a value between 0 and 1. If you used use cases absolutely as your method of specifying your functional requirements, you will obtain a value of 1. As you implement more features that were not specified in your use cases, the value of U will decrease.

There's just one problem with our metric. How can we obtain the values for Fs and Fi?7 First, dividing use cases into discrete features, Fs, is not that hard. We can examine the use cases and usually agree on the specific features they specify. We might use scenarios as the unit of measure or look at each step that describes system behavior as a feature. But the harder part is to calculate Fi. If you simply test for specified features, you can determine if they are all implemented, but you cannot determine from this whether or not your developers have implemented unspecified features.

Let me suggest a few ways of estimating the value of Fi. An excellent, yet time consuming way to determine whether the code implements just what the use cases specify is to perform code inspections on all code that gets checked in to the project. An inspection is somewhat different from a review in that you are looking for specific things. In this case, you try to identify the specific features implemented in the code and then map them to the use cases.

A second way of locating unspecified, yet implemented features is to let your testers perform a type of exploratory testing on the software. Exploratory testing is a technique popularized by Cem Kaner and James Bach.8 Using this technique, the tester does not start with a set of test cases or any other pre-defined testing script. The tester simply begins to "explore" the software to see what it can do. The tester tries to determine what the product allows him to do. You could provide the tester with the use cases and let him explore the product with the use cases as a guide. When the tester tries something that's not in the use cases, he records it as implemented-but-not-specified.

My preferred method of finding the value of Fi involves a hybrid approach:

  • First, you run your acceptance tests that are based strictly on the use cases. These must all pass before going onto the next step.
  • Second, run your acceptance tests under a code coverage tool. The code that is executed represents the code required to implement all of Fs.
  • Finally, examine the code that was not executed during your tests. This code either implements exception conditions or implements unspecified features. Ideally, this is a small amount of code and thus it is easy to identify the unspecified features.

The above metric, U, tells us whether the developers implemented just what was specified. But does a value (significantly) less than 1 mean the developers didn't do a good job of adopting use-case-driven development? Perhaps the use cases changed as the project progressed, but the team did not bother to update the use cases. Perhaps they relied on a more informal way to communicate new scenarios and use cases. This would indicate that the actual process used the written, controlled use cases as a starting point -- not necessarily a bad situation, but different from the planned process.

When you examine the code not executed in your tests, it's a good idea to have a person responsible for requirements determine if the requirements were changed, but not formally updated. In fact, a quick way to see if this might be the case is to look at the version control log for your use-case artifacts. If they were never changed after the initial version, you can be fairly sure that the team did not keep the requirements up to date; at least keep them in a persistent form in the use cases.

Iterative development

Any tailored RUP-based process, except for the most trivial project, will contain guidance for iterative development where the software is built incrementally in iterations. Typically, the first couple of iterations are devoted to elaborating the key requirements and developing an executable architecture.9 After these iterations, each iteration should produce a working, if incomplete, system.

What questions might we ask to see if our team actually performs iterative development?

One of the characteristics of an iteration is that it is time-boxed. That is, at the beginning of the iteration, you determine its end date. When that date arrives, the iteration is over. You consider only complete features, or whatever you are using to measure progress, as an iteration deliverable. If there is a feature that is almost done, but needs just a little more work, it is not part of the iteration's deliverables -- that is, you also do not decide to extend the iteration's end date to accommodate the feature. So one question might be "Were the iterations actually time-boxed?" That is, was an iteration end date established at the beginning of the iteration and was it the actual end of the iteration?

We can define a metric for the project's iterative adherence with the following equation:

I=I subscript m over I subscript t

where Im is the number of iterations where the end date was modified (after the iteration started) and It is the total number of iterations. If the team followed iterative development strictly, I will have a value of zero. At the other end of the spectrum of possibilities, the team modified all of the end dates after each iteration started. In this case, I will be one. This metric is an easy one to calculate, regardless of how you maintain your project plans. All you need to track are the changes to the dates in your plans.

Time-boxed iterations are necessary, but not sufficient to prove that a team is practicing iterative development. What else do we need to know? Another question we might ask is: "Was working software produced at the end of each iteration?" You can determine this by counting the number of iterations that produced working software. If you archive the software configuration at the end of each iteration, usually by labeling the configuration in your version control system and rebuilding if necessary, you can verify whether the software works in a very short time.

Once you have the information for each iteration, you can calculate a working software quotient, W, as follows:

W=I subscript w over I subscript t

where Iw is the number of iterations that produced working software and It is the total number of iterations, as before. You may choose to adjust It to be the total number of iterations starting from when the executable architecture is first produced or when the first executable software first appears. This allows the value of W to vary between zero and one, inclusively. A higher value of W indicates a higher adoption of iterative development.

Testing practices

Testing is one area in which you can apply metrics during the normal course of performing the activities. You simply keep track of defect discovery and fix rates, test coverage statistics, and so on. You should be able to develop some metrics that will help determine if your actual testing process is the one defined in your overall process.

RUP has a rich set of activities and artifacts that a team can select from. If you practice iterative, incremental development, your testing should mirror development. That is, you should expect to see test cases and test artifacts produced for the first iterations and then grow incrementally as the system grows.

Let's assume that your team's testing process plans indicate that unit tests will be provided by the developers for every feature they develop, and that the testers will develop a set of integration and acceptance tests, based upon the requirements, that would test the working software at each iteration. Once again, you need to find questions whose answers will help determine if your team actually follows the process. We will consider just one question for our example: "Did the developers unit test their code?" This can be easily measured if you set up test automation so that all unit tests can be run from a single command. Better yet, you can set up your configuration management system to execute the tests upon any check-in.

How can you know if all functions are supported with unit tests? That might be fairly difficult to determine. You might demand that every line of code executes during unit tests, but this can be counterproductive in some cases -- that is, it may take more effort to develop the tests than the risk of a failure in the code is actually worth. One thing we know is that there is never enough time to do everything, so we must prioritize the returned value of the activity. Whatever the reasons, you may decide that 100% coverage of unit tests is not appropriate, but you certainly expect it to be close to 100%. One metric that comes to mind here is the code-coverage percentage of the unit tests. We can obtain the data at the end of each iteration or several times during the iteration. What we are looking for is trend data. We want to know whether unit testing was done consistently during the project.

Let's say that we have twelve iterations worth of data. The graph in Figure 1 shows this data for three projects.

Figure 1

Figure 1: Percentage of test coverage for twelve iterations, over three projects

From the data, we can infer that Project 2 (shown by the plummeting lower line in Figure 1) definitely did not follow the process guidelines for unit tests. Projects 1 and 3 seem to have done a good job. We could dig deeper into this if we have daily data and get a better picture if we think it's important. The graph in Figure 2 shows daily data from unit tests for two projects that would appear exactly the same if graphed according to the x and y axes in Figure 1, but actually show significantly different profiles with the more granular coordinates of Figure 2. We only look at four iterations, assuming that we have weekly iterations (consisting of five work days). In this case, Project 2 did not spend time developing tests as they implemented the code, but spent time at the end of the iteration writing tests. If the two projects differed in their success -- for example, if Project 2 had poorer quality than Project 1 -- this might provide an indication that testing as you go contributes to higher quality.

Figure 2

Figure 2: In this case, Project 2's team did not spend time developing tests as they implemented the code, but spent time at the end of the iteration writing tests.

Summary

In this article, I have suggested a very small set of metrics for only a few basic practices. But I think this gives you an idea for how you might decide which measurements are appropriate, so that you can develop reasonable, easy-to-compute metrics. If you only plan on capturing metrics for one or two projects, don't bother with this approach. There just isn't enough information from which you can make any valid inferences. However, if you begin to capture a few key metrics on all of your projects, you will grow a database of valuable information upon which you can apply statistical analyses. You don't need a lot of data to begin analysis and formulating theories about what is effective and what isn't. You do, however, need to gather data, in an objective, consistent way.

Managing your process and understanding whether it is effective for your organization is similar to performing experiments using the scientific method. Your goal is to be able to select and configure the right set of techniques to support projects effectively. You should identify a small number of metrics for key practices and collect the data necessary to determine if the project team actually follows the practice. As you gain experience and grow your database of measures, you can fine-tune your process based upon empirical observations. It won't be long before you can state with confidence what works and what doesn't work for your project teams.

Notes

1 See my column in the Dec. 2005 issue of The Rational Edge: http://www-128.ibm.com/developerworks/rational/library/dec05/pollice/index.html

2 The Hawthorne Effect refers to the causal relationship between observation and behavior; specifically, that the act of observing can cause the subjects being observed to act differently and perhaps perform at a higher level than they normally would. This was first associated with productivity experiments at the Hawthorne plant of Western Electric.

3 Best practices are only "best" in particular situations. However, the best practices in RUP are certainly reasonable to consider for inclusion in process configurations for most projects.

4 See The Goal Question Metric Approach, by Basili et al. A copy can be found at http://wwwagse.informatik.uni-kl.de/pubs/repository/basili94b/encyclo.gqm.pdf

5 The obvious question of "Did the team use use cases?" doesn't count.

6 We will use feature and function interchangeably in this discussion.

7 There are three terms that we use frequently when we talk about gathering empirical data -- measure, metric, and measurement. I describe them in my Aug. 2004 column: http://www.ibm.com/developerworks/rational/library/content/RationalEdge/aug04/5585.html

8 See Cem Kaner, Jack Faulk, and Hung Q. Nguyen, Testing Computer Software, Wiley 1999. Also, What Is Exploratory Testing, by James Bach is a good introduction to exploratory testing (http://www.stickyminds.com/sitewide.asp?Function=edetail&ObjectType=COL&ObjectId=2255).

9 An executable architecture is a partial implementation of the system, built to demonstrate selected system functions and properties, in particular those satisfying non-functional requirements.


About the author

Author photo

Gary Pollice is a professor of practice at Worcester Polytechnic Institute, in Worcester, MA. He teaches software engineering, design, testing, and other computer science courses, and also directs student projects. Before entering the academic world, he spent more than thirty-five years developing various kinds of software, from business applications to compilers and tools. His last industry job was with IBM Rational software, where he was known as "the RUP Curmudgeon" and was also a member of the original Rational Suite team. He is the primary author of Software Development for Small Teams: A RUP-Centric Approach, published by Addison-Wesley in 2004. He holds a BA in mathematics and an MS in computer science.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational
ArticleID=146057
ArticleTitle=Measures of success: RUP and the scientific method
publish-date=07142006
author1-email=
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers