About this blog:
This blog focuses on software quality in general, and IBM Collaboration Solutions offerings in particular. The author is an IBM employee, but expresses his observations and opinions as an individual here. The purpose of the blog is to nurture a conversation with our customers and partners about continuous improvement of our software based offerings. ~FTC.
Users are not all local to your time zone. They may be spread across the globe. Not only that, but - even if they were all local - different subscribers have different usage profiles. Many office workers use their systems primarily 8-5. And software developers all night it seems :-). Retailers often use systems until 9 or 10 pm. Entertainment industry users may need their system well into the night. And realty and mortgage companies tend to use systems intensely on the weekend, when home buyers are most active. We have customers in all of those categories. This continuous system workload is another effect of both multi-tenancy and the global dispersion of the user base. There is never a good time for a maintenance window, not even on the weekend, if ‘maintenance window’ means an outage of the services. We deal with that - as do all cloud service providers - by minimizing the need for maintenance outages, so we can accomplish most updates without an outage, and only need to take the system off-line briefly in case updates need to be made to a database schema. (PS: For larger enterprises, this challenge applies in their on-premises environments as well. Only smaller enterprises operating in one or a few contiguous time zones have a workload variation with slow periods suited for a maintenance outage. In the cloud, there are no slow periods).
Or more to the point, they are forcefully upgraded to it on day 1, so minimizing defect deferrals is really important. Well, isn't that always important, you might say. In general yes, of course it is, but Cloud and On-Premises business differs. A mode of cooperation has evolved in the on-premises business, where - for better or worse - releases numbered .0 [dot zero] are often not rolled out in enterprise production environments. It's not that we or other software vendors don't stand behind, or thoroughly test, our .0 releases. But many of our enterprise customers want to get their hands on the new functionality in the latest release, so they can set it up in a trial environment internally and start becoming familiar with it. They inherently, though rarely explicitly, accept that while we obviously run the full regression test suites against .0 releases, usage patterns for the new functionality may not be fully known yet before the software ships, and so test coverage may leave gaps, especially in areas of unforeseen configurations and usage. Early use in trial environments allows identification of problematic configurations, which in turn allows us to harden the code by fixing issues in the .1 [dot one] maintenance release. Some would say this should happen via beta testing prior to release, and it does to some extent, but the full trial coverage doesn't happen until we put the .0 release out. Customers using the software in a standard configuration with traditional usage patters are normally fine with the .0 release. More [IT] innovative enterprises, who push the envelope with unique configurations and usage patterns that leverage the newest functionality are where we find the most .0 defects. As a quality engineer, I strive constantly to improve the development and test processes to reduce defect rates, but I'm not blind to the symbiosis with innovative enterprises in play here for On-Premises environments.
In the Cloud space, the issue of defects in a first release of new function takes on a different level of importance, for at least two major reasons. First, all users upgrade on the same day. It's not just the innovative bleeding edge customers, who are willing to encounter and resolve issues to engage with new function early, who adopt the release. It is everybody. Including lots and lots of users with lesser software skills than the bleeding edge experimenters. It really requires a different mind set. Average users need rock solid reliability to do their day job. They perhaps care less about new functionality, but they care much, much deeper about reliability than the experimenters do. Many Cloud providers, including us, have a legacy background in offering on-premises software products, and for those, this is a difference. We need to take to heart that the old balance doesn't apply any more. Release criteria need to be tighter. Defect deferrals fewer. Test coverage wider.
Some Cloud vendors allow multiple releases to be in production simultaneously, but that is not the case for our offerings (LotusLive, SmartCloud).
PS: The blond character above is 'Fletcher', whom I have recruited to illustrate several of the Cloud differences in this intended series. Fletcher is my avatar in an internal comic strip used occasionally in our corner of IBM. I am grateful to my creative colleague, Jennifer Kelley, for coming up with Fletcher.
I’d like to share another LotusLive customer testimonial with you, this one from Colleagues in Care, a non-profit organization of healthcare providers, who have worked in Haiti for over ten years. I had the privilege of meeting today with Drs Kenerson & Hanson, who appear in the video, to discuss how they collaborate in the cloud. LotusLive has a unique guest model, allowing subscribers to invite external guests, which is perfect for an organization that relies on large numbers of volunteers, many of whom collaborate for relatively short periods of time. Naturally, we’re looking at ways to further enhance this particular aspect of LotusLive.
It is fascinating how a collaboration process we leverage every day, and at some level take for granted, can make such a significant contribution when applied to a very real need in a non-profit organization leveraging knowledge from thought leaders around the world. Take a look at the amazing work of Colleagues in Care.
Nice succinct statement from colleague Anna Dreyzin on why social business is important to her, all packed inside 30 seconds. She talks about listening, engaging and helping, adding value and building reputation, managing feedback, extending reach, increasing engagement and growing advocacy. I agree with them all, and I would also highlight Searching. The ability to share and search knowledge across a large team is highly valuable. For an example, see my prior post on How Connections helped Connections.
A quick pointer to an interesting discussion started by my colleague, Fernando
Salazar, in his blog on the confluence of UCC and Social Business: Unified Communications & Social Business, Part I: Apples & Oranges, or Salad Supreme? As we integrate UCC into our social business
solutions, what are the success factors we need to prioritize? Social
collaboration has multiple dimensions we need to think thru as software
The major dimension that comes to my mind is the distinction between ‘push’,
or ‘command’, driven collaboration versus what I call ‘value’ driven
collaboration. When I go to social collaboration systems, I go because I expect
to find and leverage value there. Primarily information I need. Nobody is
telling me to go there. If I don’t access a particular community, activity or
forum for months, nobody is holding me accountable for being a no-show. The
value is in my results. But every business also needs a ‘command’ channel for
type of communications. My manager holds me accountable for being up to date
with my e-mail because that’s where ‘command’ communications happen today. As we
think of integrating collaboration, we have to be careful to allow appropriate
separation, or filtering, of these types of collaboration. The last thing I want
is an overcrowded message stream resembling an overcrowded e-mail inbox. I need
filtering that makes it easy and intuitive to separate the ‘command’ and ‘value’
driven forms of collaboration.
The UCC/Social relationship is another interesting dimension, which focuses
on whether you need the answers instantly or not, and whether you know who to
ask. As much as technology allows you to ask a group of people the same
question, it would clearly be too interruptive if we all sent out multi-person
polls every time we needed an answer. When it comes to the value driven
information exploration work, I often go to a social collaboration system
without knowing who the author is of the information I seek. [See my blog post
entitled “How Connections helped Connections” for an example]. Yet, when
I find the information, I may want to contact the author for additional
perspective. UCC is more acceptable (less intrusive) when used for 1:1
communication. It’s also great for many:many collaboration, but that would be
for meetings, etc. So the synergy between UCC and Social technologies bridges
that spectrum, with Social focused on the many players and UCC focused on fewer
players. I may use the social software to search a great many authors & docs
& tags, and then use UCC software to gather context, chat with the author,
or have a synchronous meeting with the team using the information.
No doubt we need the explicit communication facilitated by UCC. But as we
integrate UCC into the social collaboration models, one of the keys is to pay
attention to the different modes of collaboration (1:1, many:many, information
exploration, information dissemination, decision making, etc) and integrate the
right technology for the right task in the right place; not just offer
ubiquitous presence awareness, or every capability in every place, but offer the
right capability in the right place. This is challenging because the social
software usage models are not always well defined. Vendors write their software
to be configurable and adaptable to appeal to the widest possible set of
enterprises, yet often fail to offer more prescriptive guidance to their
customers in best practices and best usage models. Which means UCC software has
to be very flexible allowing for efficient integration into different usage
models. System administrators need to be enabled to configure what integration
points to surface, and which ones to keep dormant, based on their preferences
and the trade-offs they’re willing to make between functionality and
What do you see as critical success factors for integrating UCC into Social
Business solutions? Please submit answers via Fernando’s blog.
Hoping we don't tempt fate with our timing, on Friday the 13th of this month, we quietly turned on an automated fault analysis capability for Notes System Diagnostics (NSD) files uploaded to our Technical Support file repository, called ECUREP (for Enhanced CUstomer REPository). In other words, any time an entitled customer uploads an NSD file, our systems will automatically - without delay - perform an analysis to determine what type of incident is reflected (crash, hang, out-of-memory condition, user killed processes, etc), and for a crash whether the crash stack contained in the NSD file matches any known problems in our database. In cases where a customer encounters a problem already seen & solved elsewhere, the system will be able to point to the known defect and the associated technote. In cases where the crash stack contained in the uploaded NSD file does not match any known problems, no result is returned, but per standard Support process, a new defect is opened to track the further analysis. The Support Engineer along with Development may manually apply other internal tools like MemCheck or Laza, to analyze the incident. Fault Analyzer has shipped with the Domino product since version 7.0 to process data captured with the Automated Data Collection (ADC) feature. Local analysis at the customer site can determine general disposition, or incident type, but local analysis won’t match the crash stack against our in-house database of known issues. That database is comprised of all NSD submissions to the ECUREP system, plus similar data captured in IBM’s internal worldwide environment with over 400,000 employees.
The new automated support analysis leverages the same Fault Analyzer tool available in Domino itself, and runs against our latest database of known issues. It can handle compressed archive files in zip, tar, tar.gz, tar.bz2, ar, jar, dump and cpio formats up to 225 MB in size. There is no logical limitation determining the 225 MB cutoff; it's a cautionary, self-imposed limit we have set to avoid slowing down related processes. Once we get a sense of how the analysis system operates, we may alter the limit. The system offers several key advantages. From a customer perspective, a first possible answer is returned much faster in cases, where the crash stack signature is known. From a vendor perspective, it provides our engineers a quick first analysis of the diagnostic data. The system 'stamps' the information into the Problem Management Record (PMR) visible to the entitled customer via the Service Request tool on the Web. This helps keep all information related to the customer's issue in one central thread visible to both the customer and the support representative.
Given that we have just launched this automated use of the Fault Analyzer in our support process, we fully expect that we will have opportunities to tune and improve the process as we learn from initial submissions and analyses. A key design concern has been – and continues to be - minimizing false positives. Returning an incorrect defect match could potentially waste time for both our customer and our support representative, so to start with we have set match criteria that we believe are specific enough to minimize false positives, but we continue to review and tune the algorithms. As we learn from the initial submissions, we will look for ways to refine the match criteria to allow more submissions to find a match, but only in cases where we can assure ourselves the identification can be done with sufficient accuracy. Experience from the first couple of weeks show that less than half the NSD submissions find a match. However, finding matches for 100% of the submissions is not our success criterion for automated fault analysis. New problems, e.g. from interaction with newly released 3rd party components, will obviously have no matches the first time they are submitted by any customer. If we were able to find matches for all submissions, it would mean that all problems were known. And that would mean either that we were terribly lagging in delivering maintenance releases, or that our customers were terribly back level in applying the maintenance. So to improve fault match identification, we're not focused on achieving matches in a specific percentage of cases, but rather on identifying those additional circumstances (crash stack specifics) that allow us to positively match with additional known issues and extend our logic to cover those circumstances as well.
I hope you agree that with the new automated Fault Analysis, we have taken yet one more step to provide more efficient support to our customer base. Crashes, hangs and resource exhaustion should be rare events, but when they do occur, rapid problem identification is key to minimizing business impact. .
Why do social tools matter in external customer communications? Because, as Andy McAfee says in this interview, “we can hear with much greater fidelity, the voice of the customer”. Right on.
As someone always interested in the customer view of quality, I have worked with surveys for years, discerning trends in quantitative survey results. We always knew whether we were getting ‘better’ or ‘worse’. Did that 3.8 score from last month grow to 3.9, or did it perhaps decline to 3.6? We knew exactly. But did we know what to do in order to improve the score? Not very clearly from simple, numeric scores.
That’s where social tools enable a much richer interaction offering technical details of customer configurations and usage patterns, preferences, new requirements, and reasons for them. One way to engage with us via social tools is to join the IBM Collaboration Solutions Community on Lotus Greenhouse. We have a range of Facebook pages, Twitter handles, blogs and other social tool engagement, but the Community is the most powerful social connection in my opinion, because it lets you interact directly with our engineers and with fellow administrators or users in other enterprises to discuss common interests, share documents and more. Social tools have a central place in engaging your customer base, because the fidelity of the customer’s voice is so much better in those tools than in old school blind surveys.
As a quality engineer, it's important to explore select customer success stories, so we can work to replicate the associated success factors across all deployments. In that vein, I'd like to offer a list of IBM Collaboration Solutions customer success stories for you to enjoy. I’m not including any analysis here; just sharing a collection of testimonials.
These testimonials demonstrate part of the value our software brings to our clients across our portfolio. Enjoy every one. This list actually comes from an internal blog post I wrote last year, so most of the linked videos are about a year old, but I like the collection covering most of our key on-premises products. I’ll naturally continue to share additional, more recent success stories along the way, as I already have with for example Signature Mortgage and Colleagues in Care.
Following up on my recent cloud difference series, I wanted to share a pointer to a good blog post by Dustin Amrhein: It’s a bottom up world. Your cloud service needs to be callable with easy to use, well documented APIs. You need to cater well to developers, who are key influencers and often decision makers, for prospective subscribers’ cloud adoption. Right on.
I've decided to wrap the series on cloud differences for now. My next set of thoughts were in the direction of the project management intricacies associated with transitioning an enterprise into a global cloud environment, but I can always return to that some time down the road. For now, I want to remember that my topic is software quality and collaboration, not only in the cloud, but on premises as well. We have exciting stuff going on around social software, mobile devices, unified communications, and exceptional web experiences as well. The cloud difference series was not an attempt to communicate deep technical substance; it was an outline of some of the many things we think through as we build a successful portfolio of cloud offerings. A skate across the surface, if you will. I created this wordle based on the blog content. In case you’re not familiar with wordles, know that relative font sizes represent relative frequencies of occurrence of each word in the source text.
I hope you'll agree we have a compelling set of offerings for cloud based collaboration. I know it's quite competitive. And, as this series has hopefully demonstrated to you, our team is committed to continuous improvement that will continue to position our services as market leading. Feedback on the series welcome :-) Now on to other topics next week.
Ok, I’m not trying to become a KLM evangelist, but you gotta’ admit these guys & gals are serious about driving value from social media. Love it! “Choose your seat based on fellow passengers’ Facebook profiles”. Every flight is a networking opportunity :-) If you could choose your seat neighbor for the next flight, who would you rub shoulders with? Provided they allow public access to their profile, of course….
When the media ask whether cloud computing is ready for prime time, the key topics are resilience and security. Back-up and restore capabilities play an important role for resilience. The ability to recover from adverse events, whether natural disasters, sabotage, disk failures, or other, needs to be broader and more granular than it typically is for on-premises customers. The reasons for this include the concentration of a high number of users being served from one data center and the multi-tenant nature of the system. This means we need the ability to restore not just the whole data center, but also individual companies, individual servers, or individual users, depending on what parts of the system were affected by the disaster. If a limited disaster has rendered an individual company or user 'corrupted', we don't want to have to do a system level restore affecting all users and/or all companies in order to recover those who were corrupted. Rather, we want to be able to perform a restore operation for only the affected companies, servers, or users. Per tenant back-up and restore capability is similarly an important idea unique to multi-tenant cloud environments, though not generally implemented and automated.
It's in the nature of a multi-tenant system that the various system logs will contain information pertaining to users from different subscribing companies. This sets up a potential conflict, when a subscriber has need for a log capture, e.g. by data subpoena, for troubleshooting, or otherwise, and for confidentiality reasons we can't disclose information about other tenants. We need to be prepared to quickly scrub a log of information about other tenants' transactions and deliver a subset of the log entries showing just those entries, which are relevant to the customer requesting the log. That subset of the log is what we generally refer to as a journal.
We also need to specify the retention time for logs. That's a decision on-premises customers make themselves, but for a Cloud system we need to set that value, and set it so everybody is satisfied, and compliance requirements are met. There is naturally only one log retention period, and it applies to all subscribers.
Finally, we carefully manage access to production system logs. In the interest of our subscribers, only production system administrators can access these logs. Developers needing to extract log information to troubleshoot on behalf of a customer – even though their work is customer requested - need to go through an exception process, demonstrate their legitimate need for the information, and have an extract provided to them. This is not vastly different from on-premises environments, where log access is also controlled. They key difference in the cloud is that the server logs contain information generated by multiple tenants, and we need a repeatable mechanism to filter logs and provide single-tenant extracts.