Measure & Improve Quality
QA1-Skip 110000APC2 Tags:  quality gqm measures mcif process metrics process_improvement 1,669 Views
Today I had the privilege and pleasure of attending a VoiCE discussion on Rational Method Composer (RMC) which included some discussion on the IBM Measured Capability Improvement Framework (MCIF). Dr Chris Sibbald presented, then opened the forum for discussion. What happens in VoiCE stays in VoiCE, so I won't touch on those items. Instead, let me share two separate (but related) requirements for successful process improvement: metrics and measures.
Here is a quick, public summary of MCIF:
- Establish business and operational objectives
- Prioritize practices and define roadmap
- Accelerate adoption with tools and services
- Report, analyze and act on results
From a process engineer's perspective, business and operational objectives are a given whether derived by MCIF or some other method. We certainly escalate suggestions of benefits to be derived from exploiting an opportunity for change through adopting a set of one or more practices, but we are typically tasked with attacking the current shortcomings apparent to executives. For this posting, let's agree we are given objectives.
Our task is then to find a set of practices which will move the teams toward achieving those objectives. Reqt # 1. We cannot select practices without metrics.
In particular, I strongly assert that metrics which represent the current state and which are expected to indicate the value (or loss) from process change must be expressed before any attempt is made at practice selection. How else am I to compare available practices and select from among them? We might implement any without baselining*, but would afterwards be unable to tell 'different' from 'better.'
How do we get from objectives to metrics? I once attended an ITMPI presentation by Dr Victor Basili (of my alma mater) on the Goal Question Metric approach which has been applied to study the value of (waterfall) software process improvements.
Let's now assume that a successful GQM or other technique provides some metrics with which to select and assess process changes. MCIF would then prioritize these and define a roadmap for their adoption.
Metrics are necessary but insufficient for achieving process improvement. Reqt # 2. Metrics must be decomposed to measures, and those measures (as well as their relationship to other components of metrics) must be communicated to the development team. Please note: This does not imply quotas for measures.
The distinction is important because members of a software development team have direct control over product and process measures but may have no ability to control (or even to view) metrics. An analyst may not know the average cost to deliver test results per use case point but can directly affect the minutes required to outline the scenario currently under development. Process engineers need to provide the team with measures which can be viewed, tracked, and controlled by the development team members. Importantly, fluctuations in those measures for special cause can be identified immediately by the team and communicated to the process engineer.
Metric? Measure? What's the diff? Gary Pollice provided better definitions, but here are mine:
- A measure communicates a value relative only to a scale. (millisecond, defect)
- Descriptive statistics are a special form of metric which relate a set of measurements to itself to predict expected values for such measurements. (Longest running transaction, Typical defect severity)
*Baselining is a key part of Shewhart's PDCA/PDSA cycle.
Here's an idea:
When outlining a use case to develop the initial use case model during the Inception phase of Unified Process (RUP, OpenUP, EssUP, Your UP), add two NFRs (non-functional requirement) statements to each use case:
“Single execution of the basic flow should/must not exceed N minutes/seconds/milliseconds/.”
“Up to N executions of the basic flow should/must occur per hour/month/day.”
For example, an operator should be able to perform the basic flow scenario in 20 seconds, and this scenario may be executed by the operations team as many as 500 times in an hour.
This is hardly the level of specificity required to design a performance test, but should be at least adequate for the Architects and DBAs to make decisions about the candidate architecture's ability to meet those criteria.
One of the challenges we face is that RUP's Test discipline is singular, but User Acceptance, QA, and the Performance test groups are separate and distinct. Generally, in my experience with several companies, the last test group to be engaged is a technical services team who are concerned with how the new app may take resources from existing production processes. We want to conduct performance testing during Elaboration, but we have no way to execute most of the scenarios during Elaboration, so a promising approach is to tune for the most significant performance requirements. The suggestion above is to help quickly, briefly, identify those scenarios.
Please consider adopting this convention when drafting your own use case model. Or ... suggest a better way!
Today we have university degrees and voluntary certification programs. But people absent both continue to build and deliver bad software. As a result, we suffer as a society. Information applications have global implications yet private discretion. If customers require a license to use the software, why don't developers require a license to build it? Perhaps we need laws which require a license to practice software engineering.
Professional engineers (and sometimes practical engineers) bear personal legal responsibility for their actions. Despite a few notable cases, software engineers routinely err without consequence. Perhaps the threat of lost property and liberty would evoke greater professionalism. Perhaps a death penalty (dissolution) for egregious offenders who hide behind a corporate veil? What do you think?
What should be the minimum level of qualification? Do we need apprenticeship programs and academic practica for licensing?
Wouldn't we be better off if only software 'signed' by a licensed engineer were allowed (by operating systems) to execute?
Perhaps certification need only apply to closed-source solutions, for open source risks can be viewed in advance? We have no emperor like Frederick II to generate the edict and designate Salerno, so which governing body should take on the role of licensing software engineers?
Hot in Orlando, but a very cool session speaking with IBM about possible product directions for performance measurement. Anyone who speaks more than one language (like English and java) has seen semantics get in the way of communication ... it just happens. Today's session got past those stumblers and aided true dialogue. Suffice to say that Rational thinks, and Rational listens. This year's conference looks as if it will be very enlightening.
We played a game based on the description of Zhong Zhi provided here. Had three teams. Two finished quickly; the third took considerably longer. This game is simple: no set up, no long discussion afterwards. teaches rather quickly that if we each go our own way, work will be duplicated and some work missed, but if we communicate and collaborate, we can produce as a team what no one can as an individual. we scheduled it at the beginning of the workshop to get people moving and talking.
QA1-Skip 110000APC2 782 Views
Okay, let's assume we estimate work items using a brief Fibonacci series. (0, 1, 2, 3, 5, 8, 13) Why use this expanding interval series? Uncertainty with respect to size varies directly and geometrically with size. In other words, we are pretty good at estimating small things and less precise about estimating large things.
So we enter a sprint (iteration) with work items sized 1, 1, 3, 2, 1, 8, 8, 5, 2. We expected to complete all of these, which sum to 31 points (or ideal hours or whatever).
As the sprint proceeds, we burndown these items: 5, 8, 3, 2, 1, 8, 1 and then run out of time; those sum to 28. We had two additional items (sized 1 and 2) which were planned but not completed. We worked on the size two item but didn't finish testing it. We did not start work on the size 1 item.
So what was our velocity during that sprint? Some might argue that the team should get partial credit for working on the size 2 item which was not fully tested. Most current practitioners (and I) would say anything not tested is not completed. Only the completed work counts, and those add up to to 28, so that was our velocity. I think 28 is too exact.
According to our series, an eight could have represented any value between 6.5 and 10.5, a five any value between 4 and 6.5, .... And the range for any value overlaps its predecessor and successor (ranges are not mutually exclusive). So the metrologist on my shoulder argues that a set of ranged values cannot be summed to a point value. The actual velocity of our team was something between 22.5 and 37. I would personally call it "between 25 and 35."
So what? Isn't this overly complicating something which is really simple? My point (pardon the pun) is that we are dealing with imprecision in original sizes, so let's keep that imprecision in any statistic derived from those sizes.
Here's the practical aspect:
if we are tracking velocity manually, use 28 but call it "about 30" and don't lock our team into trying to fit exactly 28 points into the next sprint.
if we are using an automated system to track velocity, let the computer provide a ranged value for velocity and ask the team to find a place inside that range when planning the next sprint.
QA1-Skip 110000APC2 886 Views
Requirements traceability is vital for supporting consulting deliverables, but what value does it add if the development is internal and the software is accepted (passes acceptance testing)? Traceability, which assumes that development is built from requirements, seems to be a method for explaining mistakes rather than preventing them. Wouldn't test driven development accomplish provide quality assurance with less overhead?
Some things about defect management are obvious, some subtle, some constant, some context-dependent. One fact remains constant: defects exist within a larger environment, therefore the management of defects should be integrated within that environment. Let's review some simple truths:
A test fails because an actual result was not expected, so...
ideally, the failure is a symptom of a software (AUT) fault
or it may be a symptom of a test fault
or it may represent a fault in both the test and the AUT
which would imply a fault in the specification
which both testers and developers (and tech writers) 'assumed' they understood
or ... least likely but possible, there was a glitch in the test environment.
So the first tasks following failure are to document the observation and to attempt fault isolation. (Attempting to reproduce the symptom is part of fault isolation.) Severity can be assessed immediately.
Defects become part of the backlog, and so are prioritized amid everything else; let us not fall into the trap of drifting from our backlog of work items to focus on defects. Let us capture the Severity of the defect without letting that severity fully determine the priority of the defect. This applies whether you create a triply distinct capture as suggested by Testing Computer Software or follow a business impact-sensitive severity classification later advocated by Kaner and written into IEEE 1044. Especially (but not exclusively) as the product nears delivery, it may make more sense to work minor or even cosmetic severity defects on Must-have functionality than to repair Critical issues in Would-like functions.
Formal definitions of Failure and Fault are at this link, although I no longer agree with writing down great detail of every specification (for software) and believe successful agile methods provide empirical proof of lack of need for such detail. For an excellent formal example of noted ambiguity, look at this definition of Error. There is a formal definition of glitch as a brief unintended voltage on a transmission wire, but I'm using the informal meaning here.
AUT stands for "Application Under Test' which is part of the SUT or "System Under Test." Thanks to HASI/GEADGE testing for teaching, and Ron Damer for reinforcing, the simple, obvious idea that "system" testing is always more than software.
Once upon a time, perhaps a bygone time, there lived a set of norms which formal authors and speakers fit about their expressions. That was apparently long ago.
Today I read with some delight and more incredulity from Software Language Engineering: Creating Domain Specific Languages (purchased --30% off-- at RSC 2009, Thanks digitalguru!). The work itself seems well thought and adequately researched, but the swamp of contemporary idiomatic muck one need wade through in the first chapter is unnerving. Certainly by the third chapter things have settled down to a professional tone.
Were this the only datum, I would ascribe it to idiosyncrasy, but instead too readily recall a presentation at RSC where the presenter spoke glibly of her education, her feelings, and her latest project experience ... rather than the topic of the session: features and use of a new IBM tool. And the choice of terms was what I may expect of some Hollywood actor who dropped out of high school, not the person who eventually demonstrated proficiency with both the tool and its usage.
These are not idiots, they only created the impression of cluelessness by poor discipline of indirect beginnings. Both are from locales not my own; this is The South, where place more emphasis on the relationship than the product or service, but we seldom try to hide our intelligence behind meaningless banter unless distrustful of our hearers. ... perhaps that it why I felt somewhat offended by greeting by idioms vice ideas. Trust me, share what you think; be direct; "it's okay."
There were several very good questions asked in Orlando. Two have been sticking in my mind as very valuable to DREw; with gratitiude to the attendees, I'll paraphrase:
Could Production defects be given more weight than test? and What about the concept that defects found later in the SDLC are more costly to fix?
The technique for reweighting defect severities at differing SDLC stages (including Production) would be rather simple and could be automated, even though the long-held maxim that defects found later are more costly is being challenged by more agile SDLCs which embrace change. As mentioned before in other forae, my goal is to find/have/share a method for associating defects with reasonable, supported, dollar values. These questions are steps along that journey.
A longer step in that direction is to leverage the idea of iteratively delivering business value. If the business value of a usage scenario were a basis for its delivery prioritization, and if Walker Royce’s description (in EXEC02 at RSC2009) of business value as a function of ROI, ROA, and Product Revenue Profile holds, then we are close to formulating a definition for defect value in terms of business dollars (as a function of degree for deliverable business value). Perfect? Not in any sense, but defined, consistent, and supported by the prior decisions of our business stakeholders themselves.
Join in my reverie: Imagine a day when the business representative gives us a use case, the business leadership assigns a dollar value to the basic flow for that use case and a relative value to each scenario, the software delivery team produces and tests the highest priority scenario, finding one defect (of any severity). The benefit of assigning dollar values would lie not in deciding which defects to attack (for the emphasis should be on delivering value, not eliminating defects) but in deciding which defects to study for defect prevention in the next iteration/evolution. We could measure, for each evolution of a product, the change (we would hope, “improvement”) in delivered versus delayed value, perhaps developing a new metric for software which approximates the meaning (not method) of Taguchi’s loss function. When we add common existing methods for tracking project costs, we could offer true insight into practical questions. Imagine being able to quantify, in dollars, how much a practice, say pair programming, improved delivery of business value relative to its cost over a prior method!