IBM is in the middle of what may be the world's largest agile transformation, with 35,000 of its own software developers making the move toward agile methods over the past few years. As teams adopt new ways of working, they ask questions: How do we know how agile we are? How do we know how well the adoption of agile or non-agile practices is going, so we can adjust and take corrective actions as needed? How do we actively involve developers in our change effort? How can we effectively share what worked, and what did not, so we can accelerate adoption?
We wanted to give them a simple way to answer those questions. So we considered two common approaches: assessments and metrics. Traditional assessments can help answer the two first questions, but they often backfire when perceived as a "big brother is watching you" approach. Metrics can be powerful, but are often made too complex to be useful. There must be other ways to answer these questions, we decided.
At IBM, we've tried different ways to improve on traditional assessment and metric techniques since 2002 with 80 teams. We have found that a brief set of questions answered regularly by a small team, leading to one or two initial improvements, followed by more as the comfort level rises, will set organizations on the right path toward overall team health. We have also found that a team doing its own self assessment, rather than being assessed by those outside the team, is highly effective. It gets team members involved, and leads them to higher levels of regular involvement as opposed to the usual lack of interest that sooner or later plagues most change efforts.
To allow our customers to benefit from an approach that has served IBM well in its agile transformation, we are launching a service called the IBM Rational Self Check for Software Teams™, or "Self Check" for short. It can be used with a broad variety of practices, whether agile or less agile, and it can be used to support a broad variety of methods, including RUP, Scrum, XP, OpenUP, and home-grown varieties.
This service is part of a broader initiative, Measured Capability Improvement Framework™ (MCIF), which provides a systematic approach to software excellence, which is supported through a variety of services offerings and products. Self Check, however, can be used as a stand-alone offering, either independently of MCIF or as a first step toward fuller MCIF adoption.
A typical assessment is done by others outside the team being assessed, and for this reason often has a "big brother is watching you" feeling to it. By contrast, a retrospective is more easy-going, giving the team time to reflect on how they are working and what results they've achieved, with the goal of improving. With Self Check, we bring together the best of these two worlds. Self Check provides some rigor in terms of producing an assessment, but the approach resembles a retrospective. Self Check provides an evaluation of how well the team is doing, yet allows the team to use structured questions that all members respond to. The result is a clear understanding of where the team is on a one-to-ten scale with well-defined parameters, while having the team involved and 100% buying in to the results, since nobody but them does the assessment.
Though we've used both assessments and retrospectives since 2002, we frequently encountered problems that turned well-meaning metrics into monsters. For instance, when a team begins to feel that the metric is itself the goal, rather than team improvement, the effort leads to more activity than positive effect. Behavior around raising or lowering a number usually contributes little to long-term results.
Six years of refining our assessment and retrospective techniques have taken us through a flurry of self assessments, including the Shodan input metric in 2002 2 , the Extreme Programming Evaluation Framework in 2004 3 4 , the Rational Unified Process Evaluation Framework in 2005 5 , scorecards, and more tools in 2006. Each has addressed problems observed in previous models.
By combining specific metrics with retrospective techniques, in 2008 we developed the "IBM Rational Self Check for Software Teams," or Self Check 1 . This effort was based on the realization that the selection of metrics is not enough. Learning how to use them without being lured into some common deadly pitfalls (which we'll discuss briefly below) has been equally vital to incorporating the Self Check into team effort on a continuing basis. In fact, use of the Self Check has spread internally among our development teams by word of mouth, because the goals are realistic, team-owned, and achievable.
Open team discussion of how to avoid misusing metrics has been part of our success with the Self Check. Our focus in this paper, therefore, is not on case study data or which metrics to use, but on goals, traps to avoid, and how to make a Self Check work for your team.
Depending on your role in the software development organization, the goals for our approach will vary. For example:
- to learn and remember best practices;
- to have a say in how the team works;
- to improve the process by learning from the experiences, good or bad, of other teams.
- to understand where the organization stands in terms of its process improvement effort;
- to see that the organization is continually improving;
- to involve team members in the change effort.
- to teach teams key practices;
- to have ready-made questions that are customizable to aid the rollout of new ideas;
- to make retrospectives so efficient that it will be easy to get teams to do them often.
Whatever your role on the team, the Self Check approach focuses on a lightweight survey that shows how much the team feels they employ given practices. All three roles listed above can discuss trends, averages, and ranges between people to narrow the list of problems, then focus on actions toward improvement. The emphasis is on team ownership. It's not about judgment, but rather a trigger for discussion. Many teams will choose to share their experience using a consistent format (including which practices were used to what extent, context about the project, selected objective metrics, and project outcome). But sharing must remain optional because the team owns its process.
Because it's important that the team feel ownership rather than control being exerted by outside parties, we have avoided the term "assessment" in describing the Self Check. Assessment is not the goal (though it may be an optional byproduct). Rather, "bite-size improvement" is the more practical goal of the Self Check tool, and in our experience has worked better as a descriptor used by the team. Trial and error in our use of metrics has unearthed traps and mistakes we don't want anyone to repeat.
We have observed five pitfalls associated with standard modes of team assessment, described below.
The more data, the better, right? It's tempting, especially for very knowledgeable coaches, to devise a lot of questions or metrics. One instrument we used early on had 100 questions, each answerable with different choices, so participants had to read each answer. And the questions were followed with the dreaded "if not, explain" free-form field. It took so long to complete that it was done infrequently, if ever, and only by a small number of representatives on each team.
The frequency with which teams will want to use an assessment is inversely proportional to the time it takes. The key is finding the point of diminishing returns for your team. Occasional thorough assessments may have their place. But long stretches between assessments are deadly because problems in the process may go undetected for too long. This risks defects in the product or other problems. Process problems need to be detected and addressed early, and the only way to ensure this is to "take the temperature" frequently.
If team members do not participate, they may forget about the practices, and loose the chance to be reminded of them. The data may be skewed by the rosy glasses of selected leaders, which robs the team of the empowerment needed to feel ownership of their own craft and process.
Many people fear that metrics are being used to micromanage or criticize them. On the surface, a "scorecard" may appear to be a good tool for driving change. But in practice what happens is the team hates filling out data for use by other people. Many view the scorecard as a club to hammer them into some shape determined by management. Besides, the chore of gathering data for someone else can feel demeaning, even counterproductive. Have you ever participated in metrics that use a greenlight, yellowlight, redlight metaphor to signal project status respectively as on-track, tentative, in jeopardy? Typically, the numbers turn green (apparently improve) until the team figures out how to escape such an onerous task. In summary, the evil scorecard yields questionable data and discourages the team from using the technique. This pitfall cost us one year.
Sometimes data is collected and analyzed, but the proposed actions are never implemented. This makes it hard to generate enthusiasm for another round of lessons learned, and has been used as an argument to forego collecting any more data. Lean software development 6 teaches us to focus on complete and frequent small batches. There is no credit for partial work, so actions listed but not completed are waste and lost opportunity for improvement. When people don't see their changes acted on, they are discouraged from doing further retrospectives. This problem can occur with many retrospective techniques, especially when ownership of actions is not clear, or if they are not watched as part of the workload of the team. Lessons forgotten cost us another year.
Some teams want to choose from a buffet of practices at their own pace instead of being forced into a given process. They may have valid reasons for wanting to avoid certain practices, and preferred practices may change over time. For example, an organization may move from emphasizing XP to blending it with Scrum 7 and scalable elements of RUP. That makes it awkward to have the assessment closely tied to the methodology. The goal may be a coherent proven process, but some teams have done better by learning the lessons themselves rather than being forced to adopt more than they can handle all at once. Such efforts rob the team of feeling self direction and ownership of their process. Agile is always evolving. Settling on a rigid group of practices makes the process itself non-agile.
Sometimes case studies and experience reports are produced, but do not consistently assess the usefulness of key techniques. Informal reports offer tantalizing conclusions, but fail to describe the context necessary to understand how the conclusions might potentially apply to other environments. For example, a report by an agile team may say that the iterative process followed was great, but not disclose how much automated unit testing was done. Was code inspected, or was pair programming used? How big was the team? Were they co-located, and did they have government standards to meet? Without some common baseline data, such experience reports are of less value than they could be.
Furthermore, each experience report tends to be written from scratch, without a common format or structure for the report. Compared to a simple "fill-in-the-blanks" form, the unstructured approach takes longer to produce, and it is actually harder to read, since you cannot glance through the report to see what is relevant for your needs. We also found that text-based reports often lack the sort of objective clarity that well-designed graphics can offer, especially when simple metrics are presented. In many cases, graphical representation with minimum text can show compelling results and suggest clear recommendations for improvement.
After wading through the pitfalls the hard way, we've arrived at four key goals for the Self Check offering, which are the basis for the tool's design.
The subjective survey in Self Check serves not only to quickly collect data, but also to remind and teach team members about key practices. Data collection pays for itself through this learning effort. Admittedly, these numbers are subjective, but that's part of their value. Their purpose is to understand the "mind of the team" regarding key practices as early as possible. You can also gather objective metrics later, but asking people how much they "feel" their team heeds a given practice helps remind them and learn about that practice. This takes into consideration that they sometimes get busy with their code and forget about the practices. Quiet people may get lost in terminology and be afraid to ask. Silent dissenters may not have yet dared voice their doubts. Enthusiastic coaches may think their process advice has been taken to heart more than is really the case.
Teams have struggled to conduct efficient retrospectives. They either run out of time before listing any actions, or list more actions than they will ever be able to implement. The survey questions used in Self Check can narrow discussion to just those areas needing attention. So instead of protracted discussions, teams can quickly move to choosing actions. And they will still gather trend data for future analysis and guidance, as needed. They can rotate the questions to cover different areas in more depth while still keeping the list of questions small. They can revisit core questions to gradually collect trend data on the common practices. And for variety they can include other retrospective techniques or metrics from authors such as Dean Leffingwell, 8 Esther Derby, and Diana Larson 9 .
What did team members begin to see that gradually relaxed their fears of agile? People often stubbornly think they are different, especially in a large company such as IBM, whose environments are large and complex. However unique they may feel they are as individuals, they are very receptive to seeing experience from within their own company. Seeing experience reports from within the organization has been a catalyst to turn doubters into believers.
We have produced a variety of levels of detail for our experience reports, ranging from a single PowerPoint slide to a 10 page formal paper. It's been effective for teams to have a choice depending on their appetite for metrics and formality. One goal was to make the experience reports able to fit one PowerPoint slide. People can still write ten page papers if they want, but the one-slide formula has speeded up sharing and management support. Showing a sequence of six, one-page experience slides to management was a powerful moment.
Our technique is not intended to provide data for people outside the team. Avoiding the temptation to make the Self Check overly easy reinforces team ownership of the data and process. Those outside the team may access the experience report, but ONLY if the team wants to share it. So, while the tools can quickly summarize results, our guidelines allow teams to decide how widely those results are shared.
What we've discovered roles up under a general principle: Limit your objective. Fewer questions, smaller teams, and a focus on one or two actions lead to better results than broad, boil-the-ocean agendas.
It's best to answer fifteen questions or fewer every iteration (every two weeks, for example). Frequent small batches are much better than infrequent large batches of questions. The list of actions that grow out of the discussions is more manageable, which encourages surveys to be conducted often enough to catch problems before they fester. It's hard for enthusiastic coaches and facilitators to limit the list of questions; it took us two years to slash our list of core questions from 30 down to 15. But less is more. The surveys need to be speedy and consumable. Subjective experience with teams shows 15 (perhaps less) questions are the upper limit of people's enthusiastic attention for this technique.
The people doing the work conduct the survey in groups small enough to have a stake in the changes. So a project of 50 people should hold surveys in five groups of ten. Often teams have natural sub boundaries based on components, or Scrum teams of seven or so people. That is the right size for a survey discussion. Larger groups feel less ownership, have fewer opportunities to speak, and may stall when trying to identify improvement actions.
You should choose and complete one or two actions before worrying about more, and you should complete or cancel improvement actions before the next survey, where new actions will be identified. Experience has shown that teams fail to implement more than one or two changes to a process, mostly because the work of coding, testing, and engineering requires much of a team's attention, which is appropriate. Establish a culture of success by focusing on one or two actions every two weeks before trying to do more.
While it may be useful to collect the numbers and roll them up at a high level, the most important thing is for teams to feel ownership of the data, process, and actions. The "15 question, 2 action, 2 week" formula can be adjusted by teams as they get proficient. But it serves as a good concrete starting point.
Self Check provides visual output in the form of two graphs that give teams immediate results on 1) which practices they observed, to what extent, and the variation among the team; and 2) an overall view of team strength regarding a given software development practice. Samples of these two graphs are shown in Figures 1 and 2 below.
Figure 1: The first graph shows which practices the team observed, to what extent, and the variation among the team. Error bars show the mean plus or minus one standard deviation, indicating diversity of opinion on the team.
Figure 2: The second graph shows a deepdive view of team strength regarding a given software development practice.
Teams may choose to show their survey data to characterize their progress on their agile journey. There are no right or wrong numbers, since the purpose of the graph is to trigger discussion. To make it easy to quickly complete the survey, all questions are worded so that answers range from 1 to 10, with 10 being the best response. The numbers represent a percentage: e.g., "what % of the time does your team do X." For any question, a score of 1 means that you think your team never uses the practice (or less than 10% of the time); 5 means a great job half the time, or a so-so job all the time; and 10 indicates your team always does the practice and there's little room for improvement. But any number from 1-10 is a valid score, allowing for subtle differences between team members. We suggest that your organization adopt a core set of practices, and allow variations from this core set. Each team can hence add their own, but keep the list to 15 or less.
To illustrate Self Check in action, we've provided data for two teams working on different agile projects within IBM. We will see that these projects are at different stages on their journey from "not so agile" to "agile." This is one of the advantages of ongoing use of Self Check: it allows you to track your own journey. You will also see that at least one of these projects is quite large -- i.e., if you understand agile, you understand that it scales!
Team one is a large team of roughly 235 team members distributed over several sites on different continents. They develop in Java, and the team is new to agile. The project is 24 months long, and they have 8-week long "iterations."
The practices adopted by the team are shown in the Team 1 Self Check Big Picture diagram, Figure 3, below. Given the results, the team chooses to discuss Automated Unit Tests and Iterative development to see if there are any potential opportunities for improvement that should be translated to one or two concrete actions.
Figure 3: Team 1 Self Check big picture
As they discuss iterative development, they take a closer look at the detailed chart of the iterative practice, shown in Figure 4 below. This chart lets the team view details of a selected practice so they can see if it reflects a balanced implementation. When we ask the teams if they are iterative, they say yes. But if we look at the individual pillars of iterative, there is often something missing. In this case, they are doing kind of time-boxed iterations, but they do not deliver working software, and hence they cannot get any meaningful feedback on what they did in the iteration. This is a typical pattern frequently caused by having an overloaded iteration. The team tries to do too much, combined with doing "waterfall-like" iterations, where they first analyze, then implement, and then test, versus the recommended way of doing iterative development where analysis, implementation, and testing are primarily concurrent activities. The result of this "overloaded waterfall" approach is that they have a lot of half-baked production code at the end of an iteration, but nothing that really works.
This graph may not tell them things they weren't partly aware of, but the uneven graphical depiction will trigger discussion among team members, encouraging the team to take actions to improve the way they work. This is the positive effect we have seen Self Check to have on team after team. In this case, the team decided that the iterations were severely overbooked, and they decided to do a much better job of prioritizing what is within the scope of the iteration.
Figure 4: Team 1 Self Check iterative detail
Team Two is a 30-person team that is distributed and develops in Java. They are running a 6-month project with 2-week long iterations. Compared with Team One, this team has come a little further in their journey towards agile development, but they still have a lot of room for improvement.
Their "big picture" diagram is shown in Figure 5. As we can see, they seem to have a weak spot in Automated Unit Testing. The diagram not only shows the averages, but also the diverse opinion among the team, as indicated by the thin black error bars. These show the mean plus or minus one standard deviation, so most of the team's scores fall within the black bars, but an outlier doesn't confuse the picture. We can see that there seems to be very different opinions for the practice of iterative development, which is a trigger for further discussion.
Figure 5: Team 2 Self Check big picture
When we look at their use of the iterative practice, in Figure 6, we find they have some weak spots, but nothing unusual. As the team discusses the practice, valuable information becomes clear. The diverse opinions are a result of developers giving a high score to how well they are doing iterative development, but testers giving low scores. The testers seem to be overworked, and they have a lower opinion of the team's success with iterative technique. The team needs to continue the discussion to determine whether they favor 1) improving the Automated Unit Testing practice so developers deliver higher quality code; 2) changing the resource situation so there are more test resources; or 3) changing how testing is done, by -- for example -- driving more automation. After some discussion, the team determines two actions: Focus on developer testing to reduce burden on test team, and improve their ability to estimate (using velocity) so they have more realistic scope of what goes into the iterations.
Figure 6: Team 2 Self Check iterative detail
The Self Check offering can be used solely as a "reflection" tool. But teams can also use the information gathered by adding context and outcome, see Table 1, which is useful especially when producing experience reports.
No two process coaches will agree on the best set of questions or metrics, which is why Self Check gives you the power to customize. We do suggest a lightweight set of common denominators so experience reports will give a complete picture, and we suggest that you add as many or as few as you want from that starting point. In addition, comments from team members gathered during the survey have been quite useful, but only if they lead to focused action.
Table 1: Self Check metrics
|Context||Team size, project size (# people), geography (co-located, distributed, or multi-shore), audit requirements (CMMI, 21 CFR part 11, ISO, etc), iteration length, release length, and technology (Java, C, are you Web a web GUI, device driver, database middleware? etc.).|
|What we did||Practice survey averages and deviations. Include common practices, and your custom ones. Add several objective metrics such as % unit test code coverage, asserts / user story, time between builds, and % feedback used.|
|Outcome||Cycle Time (concept to install or customer value), Quality (6 and 12 months after release), Customer satisfaction. You may also want to measure productivity and flexibility.|
Experience reports can range in length from a single slide to a 10-page formal paper, such as the XP:EF study by Sato, Bassi, and Bravo 10 . Both formats effectively inform your colleagues not just whether you adopted a practice, but how much of what flavor of each practice you did. This allows you to over time to capture the journey you are going through, as you evolve from for example a less agile to a more agile development approach.
Self Check is designed to return power to the engineers on the team by allowing them to discuss their own data and improve their process without the usual sense that they're being "assessed" by an external party with its own agenda. Over time, teams may wish to incorporate objective metrics into Self Check as a means of measuring practice adoption, although this may require two or more environments to be up and running simultaneously.
Using the Self Check while remembering the pitfalls has worked well for our teams. Of course, in any adoption of assessment or metrics, teams should avoid the five deadly perils of "retrosessments" described in this paper. They have been driven out of earlier versions of the Self Check offering. Whether you use Self Check or another method for your reflections and assessments, avoiding the five pitfalls can turn your metrics or lessons-learned programs from something teams dread to something they embrace and spread throughout their peer teams.
Using these metrics as part of your retrospectives does not guarantee project success. Teams can have apparently great numbers, yet still fail. These techniques are not a safeguard against failure, but rather a tool for teams to make frequent small steps toward improvement.
We believe the lessons we've learned from our six years of pain will help speed your improvement efforts, and will save you from the pitfalls we've discovered.
Ted Rivera, Kim Caputo, Kay Johnson, Alma Erika Torres Mendoza, Victoria Thio, Inge Buecker, Brent Hodges, James Jones, Scott Will, and Millard Ellingsworth of IBM were key in improving the framework through use and feedback. Laurie Williams of NC State University was also helpful in working with IBM's agile teams since 2002.
- William Krebs and Per Kroll. "Using Evaluation Frameworks for Quick Reflections," AgileJournal.com, February 2008
- William Krebs, "Turning the Knobs: A Coaching Pattern for XP Through Agile Metrics," in Extreme Programming and Agile Methods - XP/Agile Universe 2002. LNCS 2418, Springer, pp 60-69.
- Laurie Williams, William Krebs, Lucas Layman, and Annie I. Antón, "Toward a Framework for Evaluating Extreme Programming," Proceedings of the 8th International Conference on Empirical Assessment in Software Engineering (EASE 2004), Edinburgh, Scotland, pp. 11-20
- Laurie Williams, Lucas Layman, and William Krebs, "Extreme Programming Evaluation Framework for Object-Oriented Languages - Version 1.4," North Carolina State University Department of Computer Science TR-2004-18.
- William Krebs, Chih-Wei Ho, Laurie Williams, Lucas Layman "Rational Unified Process Evaluation Framework Version 1.0" North Carolina State University Department of Computer Science TR-2005-46.
- Mary and Tom Poppendieck, Implementing Lean Software Development, Addison Wesley, 2007.
- Ken Schwaber. Agile Project Management with Scrum, Microsoft Press. 2004.
- Dean Leffingwell. Scaling Software Agility. Addison Wesley, 2007, p. 314.
- Esther Derby and Diana Larsen. Agile Retrospectives. Pragmatic Bookshelf. 2006.
- Danilo Sato, Dairton Bassi, Mariana Bravo, Alfredo Goldman, Fabio Kon. "Experiences Tracking Agile Projects: an Empirical Survey". Journal of the Brazilian Computer Society. http://www.dtsato.com/resources/default/jbcs-ese-2007.pdf
- Participate in the discussion forum.
- A new forum has been created specifically for Rational Edge articles, so now you can share your thoughts about this or other articles in the current issue or our archives. Read what your colleagues the world over have to say, generate your own discussion, or join discussions in progress. Begin by clicking HERE.
Global Rational User Group Community
Per Kroll is Chief Architect for IBM Rational Expertise Development and Innovation, an organization leveraging communities, portals, methods, training and services assets to enable customers and partners in software and systems development. Per is also the project leader on the Eclipse Process Framework Project, an open source project centered on software practices, and he has been one of the driving forces behind RUP over the last 10 years. Per has twenty years of software development experience in supply chain management, telecom, communications, and software product development. He co-authored The Rational Unified Process Made Easy - A Practitioner's Guide, with Philippe Kruchten, and Agility and Discipline Made Easy - Practices from OpenUP and RUP, with Bruce MacIsaac. A frequent speaker at conferences, Per has authored numerous articles on software engineering.
William Krebs is the Senior Software Transformation Consultant for IBM Rational software. Since 1982 he was worked as a developer, performance engineer, and process consultant at five locations within IBM. He has practiced and studied Agile and Lean development since 2001, using XP and developing RUP content. He has presented at XP/Agile Universe and IBM research, and is the co-chair of the first IBM Academy of Technology conference on Agile. A member of IEEE, the Agile Alliance, and Agile Carolinas, he currently works on deploying best development practices corporate-wide.