About this blog:
This blog focuses on software quality in general, and IBM Collaboration Solutions offerings in particular. The author is an IBM employee, but expresses his observations and opinions as an individual here. The purpose of the blog is to nurture a conversation with our customers and partners about continuous improvement of our software based offerings. ~FTC.
Virtualization has become the norm over the past decade, but multi-tenancy is not the norm in typical on-premises environments. Instead, it is associated with cloud computing. Both are critical because they drive cost advantages and enable the provider to offer a more competitive subscription rate. We have experience hosting single-tenant systems, under monikers like strategic outsourcing, managed operations, and managed service delivery. There is just no comparison. Most customers come to the cloud to save IT cost, both in terms of avoiding the larger up front license cost, in term of paying only for what you use in the case of so-called metered services, and in terms of a lower overall total cost. And those savings are driven by multi-tenancy and virtualization. Period. Not all systems are inherently architected to be multi-tenant systems, but the overall cloud solution must be. The Blackberry Enterprise Server (BES) is an example in our LotusLive (SmartCloud) environment. BES does not currently have a multi-tenant architecture. To offer a cost competitive BES service to our customers wanting to receive mail on their Blackberries, we've implemented a multi-tenancy architecture on our side connecting into BES without changes needed to the BES source code. If cost is the primary objective, there is no substitute for multi-tenancy; it is essential to cost reduction. Needless to say, both architecture, design, coding and testing have to emphasize prevention of cross-over visibility between tenants. Since multi-tenancy is basically new and rarely implemented in on-premises solutions, there are entire suites of test cases to be added for cloud solutions to verify the complete separation of tenants. Both design and test need to carefully plan around the multi-tenant architecture.
Quality has multiple dimensions, but ease-of-use is undeniably a big part of how users subjectively evaluate software they work with. This comparison of IBM Connections and Microsoft Sharepoint gives an in-depth illustration of how our teams have worked to make IBM Connections easy and intuitive to use. Collaborating, sharing documents, or becoming a social business are all topics of the day, but as this video demonstrates, to ensure you choose the optimal solution, you have to go beyond the buzz words and look at how well a solution aligns with desired work patterns and enables productivity. Social tagging is a very key aspect of IBM Connections, which helps me find relevant material, helps save my own time, and helps prevent me from having to interrupt colleagues with requests. If your team is anything like ours, you have an increasing amount of unstructured information to analyze and drive value from. Without social tagging, and without a single search capability spanning all the content, you couldn't even dream of accomplishing a comprehensive analysis. Forget the buzz words. Witness the power of a well thought out solution that aligns with your needs. .
I serendipitously ran into unexpected behavior of my embedded Sametime client on a machine running Notes 8.5.1. I locked out the screen by pressing F5 as I was getting up to go to lunch. After pressing F5, my Notes client will not let me open or send mails, nor will it let me access or write anything in any of the already open Sametime chats. That's expected behavior. But before leaving my desk, I noticed that my manager opened a chat window with a question. Not wanting to let her wait until after lunch, I instinctively typed my answer in the chat window and sent it back successfully. But the client was still 'screen locked', so this was not expected behavior to me.
Now, the Sametime preferences include a section on Auto-Status changes, which can determine what happens in response to changes in your calendar, Notes client, or operating system. I don't have the setting for 'Locking Lotus Notes' selected. Why? Because it forces the status to Do Not Disturb (DND). What I want, when I'm Away from the workstation, is really an Away status, so buddies can leave me messages in new chat windows, which I'll see when I return. With DND, no new chats can be initiated. So I don't check 'Locking Lotus Notes'; Instead, I set my Sametime status to Away and lock my Notes screen with F5. That's just my personal preference. You can see my Auto-Status preference settings below.
The observation above took place yesterday. This morning, I started working from home, then screen locked Notes and put the laptop in standby mode, drove to the office, and woke up the laptop to continue. My Notes client is still screen locked with the empty password dialog box showing. I replaced my notes \\notes\data\workspace recently to eliminate an issue caused by a non-released plug-in I was working with. This replacement naturally causes me to loose Eclipse based settings I had saved, like the geographic locations Sametime associates with each wireless access point I use. So when my laptop automatically attaches to the wireless network in the office, Sametime pops up a dialog for me to enter my 'new' geographic location. I filled it out and applied it successfully. All while the Notes client was still screen locked. Again, that doesn't strike me as 'expected' behavior, but our Sametime security architect confirmed for me that this is 'Working as Designed'. Sametime itself, meaning the standalone Connect client, doesn't have the lock-out concept. You're either contactable (logged on as Available or Away) or not contactable (DND or not online). The intersection with the Notes client is such that the above behavior is what results.
My observations lead me to two suggestions I'd like to hear your take on:
1. Should there be a choice between Away and DND, when you select the 'Locking Lotus Notes' option? 2. In the absence of the Away option, should the behavior when locked out, and 'Locking Lotus Notes' is not checked, prevent responses in not just existing chats, but new chats as well, and in dialogs like the one for geographic location, until the password is re-entered?
An important aspect of guiding quality improvement is to ensure we focus on the things that matter to our clients. This topic arose out of my own observations, and out of my own subjective expectations for how I would like to see the product work. So I need your help to find out what our clients think. Is this just a pet peeve of mine, or is this a change we should pursue? How important is this issue to you? Please share your thoughts in comments.
Whenever a high severity incident, such as a service outage, occurs in a cloud environment, repairing the system and bringing services back online is the immediate priority. But we also need to identify and eliminate the root cause of the problem. The root cause is the reason the problem was injected into the system. This is not to be confused with the immediate failure cause. Because the first priority is always to return services to normal, we first chase and correct the immediate cause of failure. In other words, we take a 'repair' action. However, that will often not prevent recurrence. For that we need to take a ‘corrective’ action as well. We have to get deeper and understand why the system entered the problem state. A cause is a root cause, when elimination of the cause eliminates injection of the problem. That's what separates root causes from all other causes. By understanding and eliminating root causes through corrective actions, we can eliminate entire classes of defects or problems, rather than simply fixing the one defect or problem we discovered. But root causes are also more costly to eliminate, especially when they require a change of human behaviors, such as failure to follow written instructions. Monitoring and correcting human system administrator behavior takes time. That's why in the on-premises world, we tend do more causal analysis, i.e. identifying clusters of similar problems/defects and targeting actions to reduce their occurrence, rather than doing RCA, i.e. determining the ultimate cause of each individual defect, which is time consuming. That balance between the more affordable causal analysis, and the more effective, but costly, root cause analysis, shifts toward RCA in the Cloud services space. To meet and exceed SLA targets, we simply cannot allow the same root cause to hit the availability number twice. Once an incident has occurred, it is relatively more likely to recur - because the triggering condition now exists - unless the root cause is eliminated. Thus, it is imperative to go after elimination of root causes of any adverse incidents observed, whether or not they caused an outage. The first instinct of many teams once they understand 'what' went wrong is to add a test case to the pre-release test case suites used to qualify new releases. But defect removal as a strategy is almost always inferior to defect prevention, and certainly more costly. By broadening our understanding from 'what' went wrong to also see 'why' it went wrong, we can take a corrective action that eliminates all the potential future problems sharing the same root cause, the same 'why'. A recent out-of-memory condition I worked with provides an example. Adding tests and throttling workloads to the troubled component might solve an immediate problem, but we need to go upstream in the development process and understand why this out-of-memory condition was not prevented by coding better memory management in the first place. By so doing, we can prevent similar out-of-memory issues in all components across our solution. Root cause analysis views the development process as a software manufacturing engine, and when it turns out a defective product, there must be a flaw in the engine to be corrected. Maniacally identifying and correcting these flaws pays off by tuning our engine to become flawless efficient, and effective. And in the cloud, that is paramount.
PS: To sort the blog and display just the Cloud Difference series, click on the “cloud_difference” tag below the title of any post in the series.
Rapid delivery of enhancements and fixes is even more critical in the cloud. We can't carry a significant backlog of technical debt. The cost of switching providers is smaller in the cloud than it is on-premises, and that means subscribers are quicker to switch. I want to be careful how that comes across. I'm not saying it's ok to carry a large backlog of technical debt for on-premises software. All I'm saying is that the consequences of doing so materialize more immediately in the cloud. It's important to drive down technical debt, both in terms of addressing warnings from automated tools doing code scans, in terms of unit test code coverage, defects, enhancement requests, complexity reduction, re-factoring, and all the many aspects of technical debt. For that reason, the IBM quality program has established a Technical Debt Governance model, which teams are now beginning to adopt in order to strengthen the focus on minimizing technical debt. Sonar is often the tool of choice to create a dashboard detailing technical debt, whether integrated with the IDE (Eclipse, or similar) or integrated with the build systems to produce post build analysis. Allowing developers to see the implications of their code, whether it raises or lowers the overall technical debt, before they even check it in to the source code management system is a powerful approach that motivates cleaner coding, more complete testing, and more complete responses to customer needs. In the cloud, it is also important to understand the range of usage models subscribers want, and to design the software to accommodate them all simultaneously, since it's a multi-tenancy system. Technical debt includes the continuous adjustments needed to better accommodate the preferred usage patterns as they evolve.
News tip: Our education team has just released a free, self-paced course on SmartCloud Notes in a hybrid environment. (The course link was originally posted in the blog Apr 26th 2011. I have updated it on May 21st 2013 after a reader notified me it was broken. The change is a result of the rebranding from LotusLive to SmartCloud).
A hybrid environment allows integration between your on-premises Domino systems and the cloud. Replicating your Domino Directory to the cloud provides for a seamless integration between environments. So rather than replacing existing Domino infrastructure with cloud based offerings, you can leverage the cloud based offerings as an extension of your existing on-premises environment. Your Domino administrators continue to administer on-premises Domino servers and applications, while IBM administers and maintains the SmartCloud Notes mail servers in the cloud.
The effort of moving a particular subscriber company's existing data and users into a hosted cloud solution is referred to as 'onboarding'. It's essentially just the cloud version of migration from one solution to another. But there is one crucial difference in the fact that data are crossing from one provider to another, typically from the subscribing company's prior on-premises solution to the SaaS vendor's cloud solution. Two very important considerations result. First, the volume of data to be migrated may well be so large, a transfer via the internet would take unacceptably long. We have other options, such as shippable physical storage devices, available for that reason. Second, because the data come from another environment, the Cloud vendor can't assume they have been subject to the same level of virus and malware scanning required in the data center environment. For that reason, we undertake scanning of subscriber data as part of the onboarding process. This is clearly necessary, as we have in fact found incidences of 'unwanted' content in subscriber data submitted for onboarding. It is key to leverage scan engines updated with the latest virus protection information. With these and other safeguards not described here, we can offer a quick, proven onboarding process.
PS: To sort the blog and display just the ‘Cloud Difference’ series, click on the “cloud_difference” tag below the title of any post in the series.
As a quality engineer, it's important to explore select customer success stories, so we can work to replicate the associated success factors across all deployments. In that vein, I'd like to offer a list of IBM Collaboration Solutions customer success stories for you to enjoy. I’m not including any analysis here; just sharing a collection of testimonials.
These testimonials demonstrate part of the value our software brings to our clients across our portfolio. Enjoy every one. This list actually comes from an internal blog post I wrote last year, so most of the linked videos are about a year old, but I like the collection covering most of our key on-premises products. I’ll naturally continue to share additional, more recent success stories along the way, as I already have with for example Signature Mortgage and Colleagues in Care.
All that effort for traditional software products over the years to configure test environments that emulate on-premises customer environments, to prioritize test scenarios, to gauge what platforms customers use the most, and to understand their usage patterns; All of that becomes so much simpler in the Cloud space. We own the production environment and know exactly how it is built. There is only one production environment architecture, even if it is duplicated across multiple data centers. We can bring Test environments as close as we want to the production environment in terms of both topology, configurations, settings, data population, workloads and usage patterns. There will always be attempts to cut corners and save expense, but net/net, we know EXACTLY what the production environment looks like, and we can even know how it is being used. We just have to peruse monitoring results to find out. Test environment parity is important to ensure our testing is representative of production environment behavior. In our test environments, we continuously act to keep parity as close as possible. We have updated load balancers, anti-virus software, memory configurations and more, to keep our test environments in sync with how the production environment evolves. We have created a process by which the test environment owners are notified in advance, when changes to the production environment are being planned. Simply telling them when changes are made is not sufficient, as it may take time to plan similar changes for the test environments. There might be hardware, or software licenses, to acquire, or there might be schedule conflicts with ongoing test efforts to resolve before we can execute the update. That is why early notification is necessary. This is simply common sense, but also a great advantage for test engineers, who can have more insight into usage patterns in the Cloud environment, than they are used to from the on-premises world.
A key aspect of any software user experience is the response time from submission of a request until the result is returned, usually in the form of an updated User Interface (UI). In the lab, where we develop new releases, we measure response times for a variety of transactions, such as opening a community, downloading a file, or sending an e-mail, under a variety of workloads. These are system response times. For development purposes and comparison between releases, it makes sense to eliminate all other parts of the total end user response time. But what matters to the subscriber is the end user response time, and that includes not just the system response time, but also the time needed to communicate across the network between the system and the user. For a cloud service, because that communication takes place across the internet, response times are less predictable. Depending on the nature of the protocols being used in any given transaction, and the network latency between the user and the system, that added piece of response time may vary with end user location. Furthermore, network latency is not a constant; it's not merely a function of the end user's physical and logical location on the network. It also depends on the network load at the time you measure latency. And on any content caching in place. So end user response times depend on at least the type of transaction being submitted, on the route taken across the internet, on network workloads at the time, on caching conditions, and on system workload at the time. Several of these, the cloud service provider has no control of. As a result, it is not possible to offer a map of end user response times based on location. Some cloud service providers offer a speed test tool, which will determine the latency between the user and the service, but the results can only be taken as general guidance. To offer a realistic impression of system response times in a pre-sales situation, the provider needs to offer time limited trial accounts, which the prospective subscriber can exercise in ways (timing, location, etc) that emulate the intended use. And for development purposes, we need to test the service from locations representative of various subscribers' conditions. Some of these locations can be simulated with systems that add latency to network communications. The advantage of simulated latency is the ability to control it, so we can compare successive releases, when they work under identical latencies. As discussed in Cloud Difference #19: Monitoring is Central, providers need to "see what users see". That thought surely extends to response times, but due to the nature of the internet and the protocols that route traffic across it, two identical transactions submitted in rapid succession can have different response times because they were routed along different paths to reach the data center. For that reason, end user response times are generally statistical measures, either average response times, or 90% percentile response times, or similar. We measure from select strategic points around the globe, which gives us a best case estimate for response times from those locations. But that still doesn’t include what is often referred to as the "last mile": the network segments between the subscriber and the nearest low latency segment of the internet. That last mile may consist of both low bandwidth internet segments and segments within the subscribing company's own intranet between the end user and their company's web facing proxy servers or gateway. We can simulate a generic 'last mile', but every company has different network characteristics. So we address response times at three different locations: (i) at the system itself, (ii) across the network including a set of strategically placed edge caching servers, and (iii) at end user locations within subscribing companies. For the first two locations, we can and do take action to improve response times, by optimizing our services code, and by optimizing caching design and server network. However, in the latter "last mile" scenario, we typically do not have access to the subscriber's internal network, so it's the subscriber, rather than the provider, who must take action in cases where significant latencies in the last mile affect end user response times. We have seen examples of subscribing companies' internal environments intercepting packets and interfering with our services and response times. They are all configuration issues that can be worked out, but it explains why the provider usually can not make a blanket statement as to what end user response times will be. Detailed configurations (security scanning, proxy configuration, bandwidth, etc) in the subscribing company's network affect end user response times.
IBM announced Sametime 8.5.2 Interim Feature Release 1 this week. Much buzz has circulated about the new features already, and you can read about them in the announcements. But in the vein of my "Why better" postings, including the prior post about Why Sametime 8.5.2 is better last May, I want to briefly share some of the quality improvements we have worked on for this Interim Feature Release, or IFR 1, as well. Every software release from IBM containing new function must also identify and achieve specific quality improvements. As an interim feature release, the aggregate underlying development effort is smaller than a full feature release, which means the quality improvements are also fewer than for a full feature release, but we were able to take some good steps anyway. The quality focus in this release was on the serviceability attributes; which are the abilities to diagnose and correct any problems. We focused on providing more helpful and more meaningful log and error messages in three specific areas: (i) the install experience, (ii) the NAT ICE SDK, and the (iii) Meetings.
Within the install, we improved the log and error messages related to validating server connections for other servers in the deployment, such as DB2 and LDAP servers. In addition to improving the validation itself, we also surfaced to the user what is being validated, and what the status of the validation is. Moreover, we reviewed the not yet externalized error messages and externalized them where it made sense to give the user more information. This is ongoing effort that will continue to improve end user messages in future releases.
For the network address translation (NAT) interactive connectivity establishment (ICE) software development kit (SDK), used for integrating awareness into other applications for example, we enabled full detail prints of IceSession and MediaSession failure traces, as well as TURN - or Traversal Using Relay NAT - server details when the IceSession is created. We also made a number of improvements to the Logger output from the C++ ICE SDK.
For Meetings, we focused on improving the log messages associated with the AppShare protocol and updated many key messages.
A simultaneous announcement of Sametime Unified Telephony (SUT) 8.5.2 IFR 1 was also made this week. With the new SUT release, we now support virtualization of the SUT server. And we completed a Telephony Control Server (TCS) configuration tool, which can dramatically lower the time needed to configure your solution. We are already receiving very good feedback on this tool. We also coded an automatic restart mechanism for the event of a TURN server crash. Internally, we had also set targets for further expansion and coverage by our automated test suites, especially in the Audio Visual (AV) and Sametime Unified Telephone (SUT) functionality areas, and those targets were also met.
For a discussion of these releases, I recommend listening to Episode 79 of the This Week in Lotus podcast, entitled Why Sametime 8.5.2 IFR 1 definitely ain't no Turkey! Enjoy the new releases taking Video chat and Unified Communications to new heights. To filter the blog and show just the 'why better' entries, click the "better" tag in the line just below the blog entry title.
A great documentary from my colleague Luis Suarez, fearlessly dumping his e-mail inbox and converting to living social. He makes the point that we will see e-mail gradually transition from a content repository to again being a messaging and notification system. I’d venture that it will evolve even further. E-mail’s grip on my work life stems from the fact that I am held accountable for reading and responding to communications there, whereas in the social tools, engagement has so far been driven primarily by where I expect to find value, rather than by who might be requesting an action or a response from me. In my crystal ball, I see a convergence of the request driven and value driven patterns, or push and pull if you wish, in the social tools. We’re building Activity Streams and the like, which will blend both forms, even if we also build filters to offer different views of the Stream.
The challenge is to not re-invent e-mail in a way that carries the same burdens we deal with today; overload and parsing through unnecessary (to me) content. Instead, we need to deliver a social mail and collaboration experience that lets us focus on creating the most value. And – to position myself in that vision – we need to do so with a compelling level of quality, reliability and ease of use.
Would like to share a customer testimonial regarding LotusLive. We have received some nice press coverage in the past year for a very large LotusLive deal, the largest ever. But as this testimonial shows, LotusLive can add value no matter the size of the subscriber's organization.
What I find so interesting in this testimonial is the transformational power of the solution. This is literally a game changer for the subscriber, catapulting their services into the competitive leading edge. This is what we do best. Help customers apply technology to solve business problems. And win. .
When the media ask whether cloud computing is ready for prime time, the key topics are resilience and security. Back-up and restore capabilities play an important role for resilience. The ability to recover from adverse events, whether natural disasters, sabotage, disk failures, or other, needs to be broader and more granular than it typically is for on-premises customers. The reasons for this include the concentration of a high number of users being served from one data center and the multi-tenant nature of the system. This means we need the ability to restore not just the whole data center, but also individual companies, individual servers, or individual users, depending on what parts of the system were affected by the disaster. If a limited disaster has rendered an individual company or user 'corrupted', we don't want to have to do a system level restore affecting all users and/or all companies in order to recover those who were corrupted. Rather, we want to be able to perform a restore operation for only the affected companies, servers, or users. Per tenant back-up and restore capability is similarly an important idea unique to multi-tenant cloud environments, though not generally implemented and automated.
Delivering a highly available service is way different from producing a customer installable product. The rightful expectation of the Software-as-a-Service (SaaS), or Cloud, subscriber is that the service is available whenever they need it. A good analog is the dial tone in a land line phone. It's just there when you pick up the phone. And if it's not, your first instinct is to check the cords and make sure the phone is plugged in. In developed countries at least, the absence of a dial tone rarely causes a first assumption that the service is down, but rather that you yourself is at fault somehow, e.g. for not plugging in. That same reliability is expected of Cloud systems. But no Cloud vendors are yet as mature as the PBX systems switching our phone lines. All Cloud vendors have occasional outages still. They're short-lived, but still annoying when it happens. And we're all working to eliminate them through root cause analysis, corrective action, and other means. Most of us come from a background of writing on-premises software, since SaaS is still a young and emerging segment. In some cases, that means there are habits we need to unlearn because they don't work well in the SaaS space. And overall, it is useful to discuss, not just how to develop and deliver (SaaS) Cloud services, but specifically how it differs from our on-premises experience. I plan to share a series of brief observations illustrating the differences between developing & delivering on-premises software, and developing & delivering corresponding cloud services. I will tag each one with 'cloud_difference' for easy collection with the URL: