About this blog:
This blog focuses on software quality in general, and IBM Collaboration Solutions offerings in particular. The author is an IBM employee, but expresses his observations and opinions as an individual here. The purpose of the blog is to nurture a conversation with our customers and partners about continuous improvement of our software based offerings. ~FTC.
Whenever a high severity incident, such as a service outage, occurs in a cloud environment, repairing the system and bringing services back online is the immediate priority. But we also need to identify and eliminate the root cause of the problem. The root cause is the reason the problem was injected into the system. This is not to be confused with the immediate failure cause. Because the first priority is always to return services to normal, we first chase and correct the immediate cause of failure. In other words, we take a 'repair' action. However, that will often not prevent recurrence. For that we need to take a ‘corrective’ action as well. We have to get deeper and understand why the system entered the problem state. A cause is a root cause, when elimination of the cause eliminates injection of the problem. That's what separates root causes from all other causes. By understanding and eliminating root causes through corrective actions, we can eliminate entire classes of defects or problems, rather than simply fixing the one defect or problem we discovered. But root causes are also more costly to eliminate, especially when they require a change of human behaviors, such as failure to follow written instructions. Monitoring and correcting human system administrator behavior takes time. That's why in the on-premises world, we tend do more causal analysis, i.e. identifying clusters of similar problems/defects and targeting actions to reduce their occurrence, rather than doing RCA, i.e. determining the ultimate cause of each individual defect, which is time consuming. That balance between the more affordable causal analysis, and the more effective, but costly, root cause analysis, shifts toward RCA in the Cloud services space. To meet and exceed SLA targets, we simply cannot allow the same root cause to hit the availability number twice. Once an incident has occurred, it is relatively more likely to recur - because the triggering condition now exists - unless the root cause is eliminated. Thus, it is imperative to go after elimination of root causes of any adverse incidents observed, whether or not they caused an outage. The first instinct of many teams once they understand 'what' went wrong is to add a test case to the pre-release test case suites used to qualify new releases. But defect removal as a strategy is almost always inferior to defect prevention, and certainly more costly. By broadening our understanding from 'what' went wrong to also see 'why' it went wrong, we can take a corrective action that eliminates all the potential future problems sharing the same root cause, the same 'why'. A recent out-of-memory condition I worked with provides an example. Adding tests and throttling workloads to the troubled component might solve an immediate problem, but we need to go upstream in the development process and understand why this out-of-memory condition was not prevented by coding better memory management in the first place. By so doing, we can prevent similar out-of-memory issues in all components across our solution. Root cause analysis views the development process as a software manufacturing engine, and when it turns out a defective product, there must be a flaw in the engine to be corrected. Maniacally identifying and correcting these flaws pays off by tuning our engine to become flawless efficient, and effective. And in the cloud, that is paramount.
PS: To sort the blog and display just the Cloud Difference series, click on the “cloud_difference” tag below the title of any post in the series.
The prior post stated that "the cloud is a cost play", but that's not really all the cloud is about. The ultra-fast provisioning enabled by virtualization technology can be leveraged to transform the way users are able to work. For example, where most on-premises collaboration environments are fundamentally about internal collaboration, the cloud services providers can - if they want - implement ways of allowing guest accounts. The LotusLive, or IBM SmartCloud, service does this. You can invite your suppliers, partners, or customers, to collaborate with you without having to pay for an incremental license for their account. That's a pretty powerful expansion of how you collaborate. It can transform your collaboration in ways your on-premises environment usually doesn't. Think about event planning, or an ad hoc analysis, or an acquisition, where you might need to share information with collaboration partners outside your company (conference center staff, lawyers, business partners, etc) for a limited period of time. You don't typically give them access to your internal collaboration environment. But in a properly prepared cloud environment, it is easy to include them and control what they can and cannot access. Internally, we need to ensure special scrutiny is applied to such differentiating features, which our Sales teams are likely to lead with. In all fairness, there are drawbacks too. For example, we're limited in our ability to migrate subscribers' pre-existing Domino applications into a public cloud, because of the multi-tenancy nature of the system. But that's why we offer both hybrid and private (single tenant) clouds as well. What can your cloud provider do for you? The options are many.
Although we don't necessarily own the source code for every layer in the stack, we do control what products are used in the stack. Stack management offers us both the opportunity and the responsibility to be smart about our choices for optimum subscriber benefit. For example, our solution involves multiple web server instances, so it makes sense for the sake of limiting complexity to standardize on one particular web server to use for each instance. There may be different schools of thought advocating that each web server is used slightly differently, and that the choice of web server software should consequently differ to achieve the most efficient solution possible. In my view, stability and cost are both top priorities served better by standardizing on one common web server. Reducing complexity leads to higher quality.
This applies not just to the choice of web server software, but more broadly to usage patterns as well. On-premises we often tell customers all the ways they CAN do things, while in the Cloud we may want to focus on the optimal way they SHOULD do things. That's because the cloud is a cost play, and the more variation we support in usage patterns, the higher the cost. That's not to say we need to whittle our options down to one single usage model for all, but we do need to strike a balance that is different from on-premises software. I don't pretend that we'll always know what usage model best serves each customer. We need to offer our users choice. What I am also saying is that we need to take the enormous variability in the on-premises world and help our cloud customers condense that into a reasonably limited set of usage models. Because that allows a cost saving. Just imagine the cost if we each drove our own custom designed and custom built car. For cost reasons, the market settles on a reasonably limited set of models. And as long as we share the associated cost savings with our customers, we'll be fine. But if we overstep and drive condensation only meant to boost our own margins without passing along savings to the customer, we will stumble in a competitive landscape. The focus must still be on providing a compelling service at an attractive price point.
PS: To sort the blog and display just the Cloud Difference series, click on the “cloud_difference” tag below the title of any post in the series.
All that effort for traditional software products over the years to configure test environments that emulate on-premises customer environments, to prioritize test scenarios, to gauge what platforms customers use the most, and to understand their usage patterns; All of that becomes so much simpler in the Cloud space. We own the production environment and know exactly how it is built. There is only one production environment architecture, even if it is duplicated across multiple data centers. We can bring Test environments as close as we want to the production environment in terms of both topology, configurations, settings, data population, workloads and usage patterns. There will always be attempts to cut corners and save expense, but net/net, we know EXACTLY what the production environment looks like, and we can even know how it is being used. We just have to peruse monitoring results to find out. Test environment parity is important to ensure our testing is representative of production environment behavior. In our test environments, we continuously act to keep parity as close as possible. We have updated load balancers, anti-virus software, memory configurations and more, to keep our test environments in sync with how the production environment evolves. We have created a process by which the test environment owners are notified in advance, when changes to the production environment are being planned. Simply telling them when changes are made is not sufficient, as it may take time to plan similar changes for the test environments. There might be hardware, or software licenses, to acquire, or there might be schedule conflicts with ongoing test efforts to resolve before we can execute the update. That is why early notification is necessary. This is simply common sense, but also a great advantage for test engineers, who can have more insight into usage patterns in the Cloud environment, than they are used to from the on-premises world.
Point releases concentrate risk. And in the cloud, that's not smart. The point release mind set no longer applies when you operate in the cloud. In the on-premises world, we bundle literally thousands of code changes into a single point release. The aggregation of value from each of the many code changes compensate for the effort an on-premises customer needs to expend in order to update servers, clients, directories, etc in their global environment. In the cloud, there is little to no effort by the customer to upgrade to the new release. So there is no particular reason to concentrate all the code changes in a point release. Instead, we can spread the risk by deploying in much smaller increments, one component at a time. Architectural separation of the solution components is pivotal to building an environment in which we can update one component without requiring a simultaneous update of another. Since our LotusLive (IBM SmartCloud) system in large part hosts code that was originally developed for on-premises products, where architectural separation of components was a lesser concern, we are now making a series of changes to further separate components. In that process, we are also reducing code complexity; a great benefit to go along with the risk reduction. This all leads to more frequent, and smaller, releases. We are now seeing our collaboration services deploy so-called Tune-Ups roughly monthly. They can contain both fixes and new function. Whenever new function is included, there is a need to enable the subscribers with information about the new functionality and how to use it. That's a good reason to group the changes by component, refreshing one component at a time, rather than a number of scattered updates across all the components. The stream of updates still has to be consumable to the subscribers, who will often want to update their help desks for each change. A random stream of changes would be confusing to both help desks and end users. Groups of changes around a particular component or theme are much more consumable. This 'grouping' strategy requires a reasonably rapid release cycle, so no single component has too long a cycle to bring updates to the production environment.
A great documentary from my colleague Luis Suarez, fearlessly dumping his e-mail inbox and converting to living social. He makes the point that we will see e-mail gradually transition from a content repository to again being a messaging and notification system. I’d venture that it will evolve even further. E-mail’s grip on my work life stems from the fact that I am held accountable for reading and responding to communications there, whereas in the social tools, engagement has so far been driven primarily by where I expect to find value, rather than by who might be requesting an action or a response from me. In my crystal ball, I see a convergence of the request driven and value driven patterns, or push and pull if you wish, in the social tools. We’re building Activity Streams and the like, which will blend both forms, even if we also build filters to offer different views of the Stream.
The challenge is to not re-invent e-mail in a way that carries the same burdens we deal with today; overload and parsing through unnecessary (to me) content. Instead, we need to deliver a social mail and collaboration experience that lets us focus on creating the most value. And – to position myself in that vision – we need to do so with a compelling level of quality, reliability and ease of use.
Enablement materials help the licensee or subscriber's end users, help desk and administrators understand how to leverage the capabilities at their fingertips. In the on-premises world, those materials ship with a new release, or are posted to the web shortly after. Yeah, I some times battle people's interpretation of 'shortly'. But you know what I mean. Enablement materials are generally launched with the (on-premises) product, which is natural because minor changes in a release can happen up to very late in the release project cycle, and we want to keep the enablement materials in sync with the code. However, in the enterprise Cloud space we rely on the subscribing company's internal help desk to handle end user calls. That's not in and of itself too different from what we do on-premises; the real difference is that the customer's help desk receives calls from end users the first day a new feature has gone Live in our hosted production environment. In the on premises world, there is time to enable the customer's help desk between the day a new release ships, and the day the customer upgrades to it. That window shrinks to zero in the cloud world. So in order to serve their end users well, we need to enable customers' help desks on new releases BEFORE they actually go live. For that reason, we generally provide "What's New" documentation in multiple languages to our subscriber companies several weeks in advance of the Go Live date, giving them the opportunity to ensure their help desk is ready to answer calls on the new release from day 1. The earlier delivery of the enablement materials imposes a freeze on release content; once the enablement materials have been shared with customers, we have to keep release content from changing.
Another aspect of pre-release enablement is our User Acceptance Test environment, which gives select administrators from subscribing companies access to exercise pre-release code and get familiar with new function before it is launched into the production service. The word "select" indicates that we're not pushing administrators to leverage the environment. We're working with those who express the desire to prepare for the new release in greater detail. The administrators who do are often from large enterprises with large numbers of users and an internal help desk that needs to be enabled in advance of the Go Live date.
Or more to the point, they are forcefully upgraded to it on day 1, so minimizing defect deferrals is really important. Well, isn't that always important, you might say. In general yes, of course it is, but Cloud and On-Premises business differs. A mode of cooperation has evolved in the on-premises business, where - for better or worse - releases numbered .0 [dot zero] are often not rolled out in enterprise production environments. It's not that we or other software vendors don't stand behind, or thoroughly test, our .0 releases. But many of our enterprise customers want to get their hands on the new functionality in the latest release, so they can set it up in a trial environment internally and start becoming familiar with it. They inherently, though rarely explicitly, accept that while we obviously run the full regression test suites against .0 releases, usage patterns for the new functionality may not be fully known yet before the software ships, and so test coverage may leave gaps, especially in areas of unforeseen configurations and usage. Early use in trial environments allows identification of problematic configurations, which in turn allows us to harden the code by fixing issues in the .1 [dot one] maintenance release. Some would say this should happen via beta testing prior to release, and it does to some extent, but the full trial coverage doesn't happen until we put the .0 release out. Customers using the software in a standard configuration with traditional usage patters are normally fine with the .0 release. More [IT] innovative enterprises, who push the envelope with unique configurations and usage patterns that leverage the newest functionality are where we find the most .0 defects. As a quality engineer, I strive constantly to improve the development and test processes to reduce defect rates, but I'm not blind to the symbiosis with innovative enterprises in play here for On-Premises environments.
In the Cloud space, the issue of defects in a first release of new function takes on a different level of importance, for at least two major reasons. First, all users upgrade on the same day. It's not just the innovative bleeding edge customers, who are willing to encounter and resolve issues to engage with new function early, who adopt the release. It is everybody. Including lots and lots of users with lesser software skills than the bleeding edge experimenters. It really requires a different mind set. Average users need rock solid reliability to do their day job. They perhaps care less about new functionality, but they care much, much deeper about reliability than the experimenters do. Many Cloud providers, including us, have a legacy background in offering on-premises software products, and for those, this is a difference. We need to take to heart that the old balance doesn't apply any more. Release criteria need to be tighter. Defect deferrals fewer. Test coverage wider.
Some Cloud vendors allow multiple releases to be in production simultaneously, but that is not the case for our offerings (LotusLive, SmartCloud).
PS: The blond character above is 'Fletcher', whom I have recruited to illustrate several of the Cloud differences in this intended series. Fletcher is my avatar in an internal comic strip used occasionally in our corner of IBM. I am grateful to my creative colleague, Jennifer Kelley, for coming up with Fletcher.
Delivering a highly available service is way different from producing a customer installable product. The rightful expectation of the Software-as-a-Service (SaaS), or Cloud, subscriber is that the service is available whenever they need it. A good analog is the dial tone in a land line phone. It's just there when you pick up the phone. And if it's not, your first instinct is to check the cords and make sure the phone is plugged in. In developed countries at least, the absence of a dial tone rarely causes a first assumption that the service is down, but rather that you yourself is at fault somehow, e.g. for not plugging in. That same reliability is expected of Cloud systems. But no Cloud vendors are yet as mature as the PBX systems switching our phone lines. All Cloud vendors have occasional outages still. They're short-lived, but still annoying when it happens. And we're all working to eliminate them through root cause analysis, corrective action, and other means. Most of us come from a background of writing on-premises software, since SaaS is still a young and emerging segment. In some cases, that means there are habits we need to unlearn because they don't work well in the SaaS space. And overall, it is useful to discuss, not just how to develop and deliver (SaaS) Cloud services, but specifically how it differs from our on-premises experience. I plan to share a series of brief observations illustrating the differences between developing & delivering on-premises software, and developing & delivering corresponding cloud services. I will tag each one with 'cloud_difference' for easy collection with the URL:
Congratulations to our newly elected President & CEO, effective January 1st 2012, Ginni Rometty. She understands better than almost anyone the need to constantly reinvent ourselves and our company. To take risks. To grow. To learn. To matter. To contribute. To lead.
Thank you to our outgoing CEO, Sam Palmisano, for an amazing 10 year run at the helm of Big Blue. Having positioned us well in terms of both company performance and succession, I know you won’t mind me saying this, and in fact, I suspect you’ll agree: The best is yet to come!
Continuing in the vein of prior posts with the ‘better’ tag, I want to describe quality improvements made in recent releases. Notes/Domino 8.5.3 has just been released, and you can read about new features in the announcement. There’s plenty to like. The embedded Symphony version has been updated to 3.0, and the embedded Sametime to 8.5.1, both key advances. There are enhancements to XPages and the Domino Designer, and much more. But quality is not just about new features. It’s about all features working well.
The overall quality objectives for the 8.5.3 maintenance release were to significantly reduce the outstanding defect backlog, to improve integration with companion products like Connections, Sametime and Symphony, and to expand test coverage and test automation. The team has delivered on all of those objectives. All major components (Domino, Notes, Designer, Traveler) reduced their deferred defect backlogs by considerable amounts, some by more than half. The vast majority of those defects had not been reported by customers. They were found in house. Removing them eliminates the risk our customers will run into them. Reducing internal defect backlogs is always an objective for a modification release (a.k.a. maintenance release), but release 8.5.3 has achieved reductions that are greater than is typical for most maintenance releases.
Security is a high priority for any release from IBM. In Notes/Domino 8.5.3 we moved systematically forward with further detailing of our threat model and the adoption of Rational AppScan Enterprise Edition for testing of the full attack surface across the Notes client. Similar efforts were done for Traveler and for iNotes. (Domino did this work previously). All the components had security testing in the past; what’s changing is that we’re adding Rational AppScan testing across all of our portfolio. And of course resolving all security defects before releasing.
The Domino team also focused on memory related improvements in release 8.5.3, delivering new NSD macros, and an administrator capability to track and drop 'bad' IMAP sessions, which can cause server crashes. A very key improvement is a substantial reduction in use of shared 16-bit handles, which will reduce the type of conflicts that can cause potential hangs or crashes. The aggregate result is an even more stable solution. For the Domino Configuration Tuner (DCT), we continue to deliver additional rules to help you ensure your environment is optimized. If you use DCT, be sure to download new rules regularly. We add new rules at least quarterly, and some times monthly. For iNotes, we continue to focus on achieving full parity between the Notes and iNotes client experiences, delivering important improvements to sorting by subject, to auto-processing of calendar entries, and the option to not expand personal groups when sending.
Less visible to our customers is the continued progress on test automation. The more of our standard test scenarios are automated, the more time our engineers can devote to specialized, exploratory testing around new features. Some critical areas have doubled the number of automated tests this year, freeing up engineers to expand coverage, all part of our continuous improvement effort.
Release 8.5.3 is the next global deployment candidate for IBM’s own internal environment of nearly 400,000 users around the globe. Prior to release, the IBM CIO’s Office deployed it to over 4,000 IBM employees, and our Services Division deployed a pre-release build for over 14,000 users. That means over 18,000 people were using it daily before we declared it ready to ship. The CIO servers are primarily AIX and zLinux servers. Although the majority run the client on Windows, there are a few hundred running on Linux and Mac platforms as well.
In summary, there’s a lot to like about Notes/Domino 8.5.3. I’ve described a few highlights of the quality effort here, but of course the proof is in the pudding, or more accurately, in the released software. Enjoy the new release. As always, feedback is welcome.
I’d like to share another LotusLive customer testimonial with you, this one from Colleagues in Care, a non-profit organization of healthcare providers, who have worked in Haiti for over ten years. I had the privilege of meeting today with Drs Kenerson & Hanson, who appear in the video, to discuss how they collaborate in the cloud. LotusLive has a unique guest model, allowing subscribers to invite external guests, which is perfect for an organization that relies on large numbers of volunteers, many of whom collaborate for relatively short periods of time. Naturally, we’re looking at ways to further enhance this particular aspect of LotusLive.
It is fascinating how a collaboration process we leverage every day, and at some level take for granted, can make such a significant contribution when applied to a very real need in a non-profit organization leveraging knowledge from thought leaders around the world. Take a look at the amazing work of Colleagues in Care.
Back on April 25th 2011, I started my Quality Collaboration blog on Lotus Greenhouse. Due to new authentication requirements implemented in late September, which require authentication with a Greenhouse ID in order to view blog content there, visits to the blog dropped very dramatically. As a consequence, I have relocated the blog to the developerWorks site, where you are reading this post. If you were a reader of the blog in its prior location, please update bookmarks and feeds to reflect the new URLs, if you use them. Since the blog on Greenhouse was still relatively young, I moved all the already posted content from the Greenhouse blog into the new developerWorks blog, so everything is available and searchable in one place. All the posts below have been copied over from the Greenhouse blog. All future posts will be added here on developerWorks, not on Greenhouse. The blog itself, regardless of location, is still referred to as the Quality Collaboration blog.
Looking forward to continuing the conversation on software quality in the new location. ~Flemming
My colleague Jon Mell in his 'Social Collaboration' blog discusses five myths of social software. I'd like to share a story supporting his debunking of Myth #1 It's all about Facebook as well as Myth #2 It's all about Generation Y. One of my many hats is being an IBM "Lab Advocate" for several accounts. Lab Advocates help customers and partners leverage our portfolio of offerings. The relationship is often supported by a non-disclosure agreement allowing the Lab Advocate to disclose future plans and helping the customer or partner position themselves to take advantage of new products and releases. One of my accounts is a large Partner covering multiple countries and partnering with multiple hardware and software vendors. We had a competitive situation there last year, in which Lotus Connections and a competing product both vied for their choice of a social software solution. The existing environment already had many components within it from the competing vendor, so proposing Lotus Connections was raising eyebrows. It was viewed as a new species. But an internal group had conducted a Proof of Concept (PoC) exercise with release 2.5. They were preparing their report to the executive leadership team, when I first met them to help position Connections. Like many "lab rats", I know much more about our own product than I do about the competing product. I reached out to the Connections team, but they were in a critical customer meeting introducing the beta release of version 3.0 right at the only time I could schedule the Partner to go over this. So I clearly needed to very quickly come up to speed myself on how to competitively position Connections versus this particular competing product.
That's when "Connections helped Connections".
Within our own internal (w3) deployment of Connections, I was able to quickly find the Product Management community for Lotus, join it, and locate a competitive comparison between Connections and the competing product, which in turn enabled me to make substantial arguments to the account in favor of our product. This would have taken much, much longer without Connections, and I would not have had my arguments in time. Thankfully, our Product Management team had decided to share their insight. by posting an excellent write-up comparing the two products. The same outcome would flat-out not have been possible in the competing product, because they don't have the concept of joining a community at will. They require a system administrator to grant access to each 'site', which is inherently less social, and which takes more time. In addition, search in the competing product is done one site at a time, so if you don't know where to look in their solution, you're up the creek without a paddle. Connections is so much smarter, and so much more social, because it is built as a social software solution from the ground up, rather than being an existing solution re-purposed to become more 'social'.
Because of my intervention and the plans I shared around the upcoming release 3.0, which we eventually launched in November 2010, the Partner decided to continue their release 2.5 PoC into a release 3.0 beta deployment. Did I mention the vast majority of the Partner's software driven revenue is from non-IBM products today? That certainly causes a challenge for us, but it also makes it extra exciting to grow our relationship based on excellent products like Connections, and based on the efforts of passionate colleagues willing to share their insight.