About this blog:
This blog focuses on software quality in general, and IBM Collaboration Solutions offerings in particular. The author is an IBM employee, but expresses his observations and opinions as an individual here. The purpose of the blog is to nurture a conversation with our customers and partners about continuous improvement of our software based offerings. ~FTC.
As a quality engineer, it's important to explore select customer success stories, so we can work to replicate the associated success factors across all deployments. In that vein, I'd like to offer a list of IBM Collaboration Solutions customer success stories for you to enjoy. I’m not including any analysis here; just sharing a collection of testimonials.
These testimonials demonstrate part of the value our software brings to our clients across our portfolio. Enjoy every one. This list actually comes from an internal blog post I wrote last year, so most of the linked videos are about a year old, but I like the collection covering most of our key on-premises products. I’ll naturally continue to share additional, more recent success stories along the way, as I already have with for example Signature Mortgage and Colleagues in Care.
Hoping we don't tempt fate with our timing, on Friday the 13th of this month, we quietly turned on an automated fault analysis capability for Notes System Diagnostics (NSD) files uploaded to our Technical Support file repository, called ECUREP (for Enhanced CUstomer REPository). In other words, any time an entitled customer uploads an NSD file, our systems will automatically - without delay - perform an analysis to determine what type of incident is reflected (crash, hang, out-of-memory condition, user killed processes, etc), and for a crash whether the crash stack contained in the NSD file matches any known problems in our database. In cases where a customer encounters a problem already seen & solved elsewhere, the system will be able to point to the known defect and the associated technote. In cases where the crash stack contained in the uploaded NSD file does not match any known problems, no result is returned, but per standard Support process, a new defect is opened to track the further analysis. The Support Engineer along with Development may manually apply other internal tools like MemCheck or Laza, to analyze the incident. Fault Analyzer has shipped with the Domino product since version 7.0 to process data captured with the Automated Data Collection (ADC) feature. Local analysis at the customer site can determine general disposition, or incident type, but local analysis won’t match the crash stack against our in-house database of known issues. That database is comprised of all NSD submissions to the ECUREP system, plus similar data captured in IBM’s internal worldwide environment with over 400,000 employees.
The new automated support analysis leverages the same Fault Analyzer tool available in Domino itself, and runs against our latest database of known issues. It can handle compressed archive files in zip, tar, tar.gz, tar.bz2, ar, jar, dump and cpio formats up to 225 MB in size. There is no logical limitation determining the 225 MB cutoff; it's a cautionary, self-imposed limit we have set to avoid slowing down related processes. Once we get a sense of how the analysis system operates, we may alter the limit. The system offers several key advantages. From a customer perspective, a first possible answer is returned much faster in cases, where the crash stack signature is known. From a vendor perspective, it provides our engineers a quick first analysis of the diagnostic data. The system 'stamps' the information into the Problem Management Record (PMR) visible to the entitled customer via the Service Request tool on the Web. This helps keep all information related to the customer's issue in one central thread visible to both the customer and the support representative.
Given that we have just launched this automated use of the Fault Analyzer in our support process, we fully expect that we will have opportunities to tune and improve the process as we learn from initial submissions and analyses. A key design concern has been – and continues to be - minimizing false positives. Returning an incorrect defect match could potentially waste time for both our customer and our support representative, so to start with we have set match criteria that we believe are specific enough to minimize false positives, but we continue to review and tune the algorithms. As we learn from the initial submissions, we will look for ways to refine the match criteria to allow more submissions to find a match, but only in cases where we can assure ourselves the identification can be done with sufficient accuracy. Experience from the first couple of weeks show that less than half the NSD submissions find a match. However, finding matches for 100% of the submissions is not our success criterion for automated fault analysis. New problems, e.g. from interaction with newly released 3rd party components, will obviously have no matches the first time they are submitted by any customer. If we were able to find matches for all submissions, it would mean that all problems were known. And that would mean either that we were terribly lagging in delivering maintenance releases, or that our customers were terribly back level in applying the maintenance. So to improve fault match identification, we're not focused on achieving matches in a specific percentage of cases, but rather on identifying those additional circumstances (crash stack specifics) that allow us to positively match with additional known issues and extend our logic to cover those circumstances as well.
I hope you agree that with the new automated Fault Analysis, we have taken yet one more step to provide more efficient support to our customer base. Crashes, hangs and resource exhaustion should be rare events, but when they do occur, rapid problem identification is key to minimizing business impact. .
This 11 minute video summarizes the key message in Daniel Pink's "Drive: The surprising truth about what motivates us". It is insightful and an interesting explanation of the motivation behind the open source software movement. And, oh by the way, I wish I could draw like that....
Autonomy, Mastery and Purpose describe some - but not all - of the ingredients in a successful breakthrough, whether in product features or in quality improvement. And software development being a team sport, there is a whole extra dimension of interesting motivation issues to manage. Those aside, enjoy this great video.
Back on April 25th 2011, I started my Quality Collaboration blog on Lotus Greenhouse. Due to new authentication requirements implemented in late September, which require authentication with a Greenhouse ID in order to view blog content there, visits to the blog dropped very dramatically. As a consequence, I have relocated the blog to the developerWorks site, where you are reading this post. If you were a reader of the blog in its prior location, please update bookmarks and feeds to reflect the new URLs, if you use them. Since the blog on Greenhouse was still relatively young, I moved all the already posted content from the Greenhouse blog into the new developerWorks blog, so everything is available and searchable in one place. All the posts below have been copied over from the Greenhouse blog. All future posts will be added here on developerWorks, not on Greenhouse. The blog itself, regardless of location, is still referred to as the Quality Collaboration blog.
Looking forward to continuing the conversation on software quality in the new location. ~Flemming
Quality is everyone's job. None of us is as strong as all of us. You've heard the catch phrases. In quality software engineering, they are true. One of the purposes of this blog is to open the conversation about quality to all customers and partners, regardless of which generation they belong to and how much experience they have with our software. I once hired an intern, who discovered a key flaw in a major test harness less than a week after she started. And this was a harness used by an experienced team for years. A fresh set of eyes can some times question things which the experienced eye has learned to accept uncritically. Everybody is invited to share insight and ideas via comments in this blog. Here is an interesting article in Forbes magazine by an IBMer explaining the imperative and challenges of networking across generations: http://www.forbes.com/2010/07/14/networking-social-media-employees-leadership-managing-ibm.html in the enterprise. His overview of differences in approach by three generations is illustrative, though obviously not every individual fits their generation's description. It's particularly important in the field of quality that we don't let generational differences in approach stand in the way. We approach software differently. To me, it's a perfectly normal, rational act to open up a 'manual' to figure out how a particular piece of software works. Most millenials would never dream of doing that. So I need feedback from all generations to ensure we offer a compelling user or administrator experience for each generation.
By virtue of the fact that the system is web-facing, SaaS systems are more likely targets of hackers. The more complex the solution being offered, the more ports are likely to be open, and the more risks exist. We mitigate security risks in a variety of ways, ranging from security code reviews, to vulnerability scanning and more. We have to design and scan against common vulnerabilities like cross-site scripting, cross site request forgeries, SQL injections and insecure direct object references, and also against many less common ones. We naturally use our own Rational AppScan tool for vulnerability scanning, but also other approaches and tools. For obvious reasons, I can’t share a full list. Almost all of these techniques apply equally to on-premises offerings. SaaS differs from on-premises environments by having the vast majority of the user traffic traverse the internet, not just a company intranet. That makes SaaS systems more likely targets of hackers. And as owners of the production environment, we're responsible for operational security. A top priority in managing a SaaS environment is to keep up to date with security and vulnerability patches. We, just like our customers with on-premises offerings, must set up the necessary processes to keep abreast of available security patches from vendors. We also own the responsibility to run penetration testing against the production environment. We need to distinguish between destructive and non-destructive testing when dealing with the production environment. It's ok to define a self-owned account in the production environment and attempt to gain unauthorized access to it, but it's not ok if that - or any other penetration testing - interrupts service for the subscribers. Interruptive or destructive testing must naturally be done against a test environment, built to mimic the production environment as closely as possible. Our teams also do functional security testing, that goes beyond vulnerability testing and penetration testing, as functional failures in the security and privacy functionality spaces – if they existed - could lead to very significant security risk exposures. A key aspect of security for Cloud solutions is that the security framework must rely more heavily on server side data security to prevent unauthorized data access, because the client side is usually a browser, over which we have rather limited control. When new browser versions are released, we rarely have a choice of whether or when to support them. Users expect to be able to use the latest version of FireFox, Safari, or whatever browser they prefer, the same day it is released. Given the importance of Security, which derives from the fact that security incidents can cause loss of trust by customers, we have to assume security flaws will exist in new browser releases, at least for a while until patched. Our server side security has to handle the task and ensure proper security without relying on browser side functionality. Security testing is a top priority for every release of LotusLive (SmartCloud).
To be precise, current browsers supported by LotusLive officially include Internet Explorer, Mozilla Firefox and Apple Safari. See system requirements here. Other browsers like Chrome, Wildfire, and Opera work for many interactions with LotusLive, but are not officially supported.
The Quality of Service (QoS) depends on, not just the technical quality of code & documentation, but also deployment architecture & instructions, health monitoring, operational procedures, integration into hybrid scenarios, and downtime minimization thru redundancy, virtualization, and disaster recovery. In the on-premises world, there is limited incentive for the development team to optimize deployment time. Not that upgrade duration doesn't matter, but there are other priorities dominating trade-off decisions. In the Cloud world, we look to minimize required service interruptions to update the environment. Through clustering, service delivery runtime environments, etc, we can ensure system availability through many upgrades, but there are some changes that still require a planned outage, such as back-end database schema changes. A highly available (HA) system has redundancy built in, so in the event of a failure the redundant part will take over. When it comes to performing system maintenance and updates, such a HA system needs to be maintainable in a continuously available (CA) fashion. HA is automated, while CA still requires human practice to ensure the correct steps are executed in the correct sequence during CA updates. The ease of deploying the update, and the time it takes, is what I refer to as deployability, and it has renewed importance in the cloud. Both the technical complexity of the update, and the skills of the deployment team, matter. A team that has gone through the same deployment several times can execute it quicker than a team doing the deployment for the first time, and they are less likely to execute deployment steps in an incorrect sequence or to miss a check. For that reason, I like to see the production deployment team participate in earlier deployments into the customer acceptance test environment, and into the staging environment ahead of the actual Go Live date in the production environment. Interestingly, engaging the Web Delivery Operations (production) team in updating pre-release environments has the added side benefit of exposing any differences between production and test environments, which we want to eliminate as discussed in Cloud Difference #4: Test emulates the production environment. Automation of complete, virtualized service deployments is key to minimizing the opportunities for human error during updates. In addition, we require interim builds to be delivered and deployed into test environments using the same techniques that will be used in the production deployment. And although rare, in case a deployment runs into problems, we require a back-out option that will allow us to quickly fall back to the last known good configuration and release. An alternative option is to deploy on separate hardware and cut over traffic from the load balancers once the new instance is up and running. This can in many cases eliminate the concern over deployment time and back-out options, but suitable hardware is not always available because of the associated cost. So a well tested, well rehearsed, automated and well executed deployment remains important.
PS: To sort the blog and display just the ‘Cloud Difference’ series, click on the “cloud_difference” tag below the title of any post in the series.
Virtualization has become the norm over the past decade, but multi-tenancy is not the norm in typical on-premises environments. Instead, it is associated with cloud computing. Both are critical because they drive cost advantages and enable the provider to offer a more competitive subscription rate. We have experience hosting single-tenant systems, under monikers like strategic outsourcing, managed operations, and managed service delivery. There is just no comparison. Most customers come to the cloud to save IT cost, both in terms of avoiding the larger up front license cost, in term of paying only for what you use in the case of so-called metered services, and in terms of a lower overall total cost. And those savings are driven by multi-tenancy and virtualization. Period. Not all systems are inherently architected to be multi-tenant systems, but the overall cloud solution must be. The Blackberry Enterprise Server (BES) is an example in our LotusLive (SmartCloud) environment. BES does not currently have a multi-tenant architecture. To offer a cost competitive BES service to our customers wanting to receive mail on their Blackberries, we've implemented a multi-tenancy architecture on our side connecting into BES without changes needed to the BES source code. If cost is the primary objective, there is no substitute for multi-tenancy; it is essential to cost reduction. Needless to say, both architecture, design, coding and testing have to emphasize prevention of cross-over visibility between tenants. Since multi-tenancy is basically new and rarely implemented in on-premises solutions, there are entire suites of test cases to be added for cloud solutions to verify the complete separation of tenants. Both design and test need to carefully plan around the multi-tenant architecture.
Experience shows that customers coming to the cloud, in spite of most coming for cost reasons, are looking to be able to customize the solution as much as they used to do with on-premises software. Facebook, MyYahoo and other Web 2.0 applications have set end user expectations for some level of control and customization of the User Interface. Building in customization options including basics like themes and skins can help make a Cloud offering more compelling, but it has to be designed carefully. While we want to be smart and offer general customization options, we also want to be very careful about the incremental cost of supporting custom layout and especially custom functionality. It drives up cost, and the cloud being a cost play in the first place, even small increments in cost erode our ability to offer a competitive subscription rate. That's why we have to remain vigilant about cost control in all the choices we make. The cost advantage of cloud offerings stems in large part from multi-tenancy, and the closer each tenant can align with a common design, the more we can drive down cost to offer more attractive subscription rates. It's important to stay in tune with both existing and prospective subscribers here to ensure we strike the right balance between customization and cost control, as also discussed in Cloud Difference #5: Provider controls the Stack. Avoiding customization cost may sound straightforward, but while we have generally focused on improving Total Cost of Ownership (TCO) for our on-premises software for years, it is new for most teams to specifically focus on the incremental delivery cost associated with customization. Developing detailed cost metrics around the delivery operations is a new focus in the cloud, requiring new channels of cooperation between Web Delivery Operations, Development, and Finance, especially to put in place cost models that will allow development teams to evaluate and compare the implications of design alternatives, even before the coding work is done.
PS: To sort the blog and display just the Cloud Difference series, click on the “cloud_difference” tag below the title of any post in the series.