Defining processes in your Notes/Domino environment

This article explains how you can develop good processes that help ensure reliability and availability in your Lotus Notes and Domino environment -- without turning you into a Borg!

Share:

Matt Broomhall, Lotus Workplace Technical Evangelist, IBM

Matt Broomhall started working at IBM back when Notes 3.0 first shipped. He has held positions in the Microelectronics Division and the IBM CIO office where he was responsible for deploying software tools that allow people and teams to more effectively collaborate. Matt's team deployed Sametime Connect throughout IBM. Matt has been team lead for the Solution test team within Lotus Engineering Test and has recently joined the Lotus ISV Enablement team. Matt skis in the winter and jogs when there is no snow. Matt's fondness for traveling to Disney World with his family as well as his home theater play prominently with his other leisure activities.



04 October 2004

When people hear the word "process," more often than not what comes to mind is bureaucracy, red tape, delays, and (for Star Trek fans) visions of the Borg coming to take away your free will. Processes get in the way and slow things down, right? Well, not always. Processes can also be about survival and about avoiding anarchy and excessive cost. This is especially true with information technology.

We are not advocating process for process sake like the Borg would. (Put 20 humans in a room, and you will always get at least 20 different points of view. Put 20 Borg in a room, and you have a very dull party.) Instead, the goal of this article is to promote common sense processes that smooth the steps for application development, reduce overall execution speed, and reduce IT costs. IT infrastructure is the backbone of your business -- indeed, a broken infrastructure can quickly result in a broken bottom line. Upgrading components of a company's business information systems can, therefore, be a slow, tedious, and nerve-wracking process for organizations that do not have fully formed management and controls surrounding their IT investments.

In this article, we discuss a number of processes, including application development and deployment, quality assurance, performance and sizing, and root cause analysis (RCA). The processes described aren't merely academic exercises; they're based on real-world processes used within IBM. So there is a clear and very successful precedent for the value of each process as part of a global information technology operation. This article assumes no previous experience with Lotus Notes and Domino or other IBM/Lotus products, although familiarity with software development and administration will help give context to our examples and suggestions.

Application development and deployment

Development and deployment are the necessary first steps for all applications that reside within your business information systems framework. How applications are developed and deployed will have a profound impact on later processes, such as upgrading and migration.

One of the greatest strengths of Lotus Notes/Domino has always been the ability to very quickly develop and deploy an application. However, this strength can quickly become a liability if your company does not have processes in place to help manage and control development, deployment, and maintenance of your applications. Without these processes, you can lose track of which applications have been deployed within your environment, what they're doing, who's responsible for them, what to do if they break, and so on. This can quickly lead to "application anarchy" -- one of the biggest nightmares an IT organization can face.

Standards

Most firms have, at minimum, some basic rules surrounding their IT infrastructure to account for growth and security. What is not always in place is a method for protecting the infrastructure from its own users. To do that, you must establish application development standards. These standards should include (but are not limited to):

  • Application look and feel
    Consistency reduces the user learning curve and cost to use. A consistent look and feel for all applications ensures that more consistent usage paradigms are maintained, lowering the cost to learn, use, and maintain an application within the company's IT framework.
  • Security
    This protects your data assets from being compromised. Keeping data security in mind while developing an application not only helps to protect data from prying eyes, but also serves to avoid data corruption that could result from applications that are poorly architected in regard to security. A properly architected application includes safeguards for both data protection and data integrity.
  • Agent control
    This ensures that applications share server resources equitably. Controlling agent activity helps to guarantee no single application can consume more than its fair share of server resources.
  • Replication control
    This balances data distribution needs with server and network resources. Limiting replication schedules helps to balance the needs of data distribution and server infrastructure resources. This ensures all applications on a shared server get a fair time slice for their replication requirements.
  • Performance
    This limits use of resource-hungry constructs, such as time-sensitive views.

Application development environment

Putting standards in place is only a start. Development teams also need an environment in which to develop and test their applications. To ensure a smooth transition to production, this environment should closely model the target production environment and be subject to the same operational rules -- with one key exception: It should grant development-level access to the resident applications for the application development teams. This not only helps to ensure that an application is successfully developed to fit within the target infrastructure, but also increases the chances the application will perform acceptably.

In addition, the application's "incubation period" can be used to alert the IT team about the application, when it's coming, and the purpose it's intended to serve. This helps keep the application on the planning "radar screen" for eventual production deployment. Thus when the IT team becomes involved, they already have some familiarity with what the application is designed to do. This in turn allows IT to be in a better position to assist the development team as the application undergoes quality assurance and integration verification. This is part of the deployment process and helps to validate the application for sizing and charge back. This also helps to identify the application for future regression testing, which naturally goes along with infrastructure upgrades.

Quality assurance and integration verification

In general, quality assurance ensures that the application is compliant with existing standards designed to protect the operation of the environment, and integration verification validates the application's ability to run within the confines of the standard shared environment (in other words, the new application plays nicely with existing ones). Having a production-like development environment helps to speed applications through this process on the way to production. Because the goal of application quality assurance and integration verification are to find the "gotchas" in an application before it gets rolled out to your company's production environment, finding problems may send an application back to development for further work. But even if this happens, in the long run, the data and consumers of the data are better off.


Separation of duties

While not specifically a process, separation of duties helps to guarantee that the team developing business applications is separate from the team operating the environment in which the applications reside. This helps to mitigate potential security and operational problems and to ensure that all application deployment and/or changes are thoroughly tested for operational soundness before the administrators of the environment assume responsibility for them and move the changes into production. These boundaries, therefore, also help to protect the data and the infrastructure from inadvertent harm.


Charge back

In a quid pro quo society, payment for services is standard practice. Applying this to the IT operation gives you a very important means to control application anarchy and helps ensure that capacity is always available. The simplest way to implement this is to charge the organization that owns the application for the use of the corporate environment, a process known as charge back.

Effective charge back processes take into account factors such as the number of databases required by the application, the on-disk size of the databases, usage volume, back-up and retention needs, and interconnections to other systems and data. This basically equates to renting capacity on your shared infrastructure that in turn must be supported and maintained. The downstream by-product of this practice helps to ensure that organizations carefully manage what they use and that these same organizations sunset tools they no longer need. This helps limit application anarchy and uncontrolled IT spending.

Charge back also helps to guarantee that proper funding is available to make sure resources are always in place to meet the needs of the "renters."


Regression testing

Humans do not acquire knowledge and technology by assimilating entire civilizations as the Borg do. Instead, much of what we discover and learn is through the time-honored trial and error process. This means we need to test the products we create -- not only when an application is first created, but also when it is modified and updated. When a new version of an application is released, it should be tested to ensure the new features and bug fixes that have been added have not broken any existing functionality. Similarly, when a new application, platform, or other software component is introduced into an environment, the existing applications within that environment should be tested to ensure that they continue to run normally. This process is known as regression testing or simply regressing, and it should be an essential part of all your application development and deployment efforts.

Regressing an application with a new release of the application server helps to ensure the availability of the application to the end users it serves. With Lotus Domino for example, this testing may entail using both the Domino 7 NSF system and Domino 7 DB2 for mail, applications, and the extended products. This allows key applications to be thoroughly regressed prior to an upgrade to the new release.

Another related and very sound practice is for the IT organization to provide a suite of standard application templates to cover the similar needs within a business. For example, if your company has standard templates for departmental activities, department and organizational budgeting, meeting management, and project management, the regression needs to be done only once for each application type as opposed to the possibly hundreds of home grown varieties of each. Cutting down the number of regression tests also reduces the cycle time to upgrade and greatly reduces cost.


Clustering and load balancing

Many companies have infrastructure components that they consider business-critical. This means that if the environment were compromised, the productivity of the workforce would be compromised. For many of these companies, email tops the list as do many applications directly tied to revenue generation. Increasingly this includes collaboration tools such as instant messaging -- for example, Lotus Instant Messaging (Sametime) which is used throughout IBM. When tools become critical to the organization, their availability takes on greater urgency.

Clustering

Clustering servers to host these important applications is an excellent way to maximize availability in the event a server suddenly ceases to operate -- the unlikely but possible result of an application or mail server upgrade. When an application (including an email application) is clustered, there are two identical copies of the application and data available, providing the data is stored within the application (which is the case with many Domino applications and Domino mail). You can also cluster the data repository (if it is separate from the application) to ensure total availability if one component fails.

LDAP-based directory servers (a core IT service in an environment requiring more than one server to meet volume requirements) is a good candidate for clustering. To nobody's surprise, IBM uses IBM Directory Server (now IBM Tivoli Directory Server) and has multiple instances clustered using the IBM WebSphere Edge server for load balancing and failover. (We talk more about load balancing later in this section.)

Domino mail can be effectively clustered. However, the model is slightly different due to the fact that an end user must be tied directly to a specific mail server or cluster of mail servers. Availability can be maintained with a cluster pair using standard features offered by the Notes client and Domino server combination without the use of the WebSphere Edge server. This ensures continued end user access to email if a cluster partner fails.

For IBM's internal implementation of Lotus Instant Messaging, the model is more like the LDAP server core service in that each Community server is a duplicate of the others with no requirement for an end user to be tied to a specific server. Instead, the home server is the cluster name. In this case, the instant messaging environment consumes directory and authentication resources from the shared LDAP directory cluster described previously, and buddy and privacy lists are also stored on a separate relational database infrastructure (DB2 in IBM's case). This federation of services, when taken together with clustering options, helps to maintain availability of all services to the maximum extent possible. This can be critical when upgrading infrastructure with high availability requirements.

For more information about IBM's Lotus Instant Messaging (Sametime) deployment, see the articles, "The hitchhiker's guide to Sametime deployment at IBM" and "Life in the fast lane: IBM moves to Sametime 3."

Back-end relational storage can also be configured for enhanced availability. One option is to use HACMP (High Availability Cluster Multiprocessing) to cluster the DB2 instances. This means data storage is not the single point of failure.

In the case of Lotus Instant Messaging and Domino mail and application servers, a high availability environment also offers protection against a failed upgrade. The opportunity to introduce a new software release and still offer high availability can be realized when upgrading a single cluster mate while leaving stable releases in place while the new release is being validated.

Load balancing

Designing an infrastructure for resiliency is one key to helping you ensure greater availability. One way to do this is through load balancing. Load balancing across clustered systems can provide maximum availability of each key service. One approach (and there are many) is to start by segmenting the IT infrastructure via service provided. This separates the services and targets them for separate and specific infrastructure resources. This limits the impact of one service on another, and also offers a way to cluster similar services for availability and ultimately resiliency. The implementation of services differs, so the clustering approach will likely also differ.

Load balancer technology, such as the WebSphere Edge server, can also be a single point of failure if the solution is not architected well. WebSphere Edge server can be clustered so that a second server can essentially act as a "hot" spare if the primary fails. This means that the user load can continue to be delivered to the target infrastructure.


Upgrade protection

There are some additional precautions to be taken during an upgrade or migration. For example, introducing a new Domino version into the mix means new versions of key files, such as the Domino Directory and templates, as well as updated versions of the on-disk structure (ODS).

Regression capacity offers a way to validate compatibility with earlier versions. However, in a full production IT environment, additional precautions are prudent. Preventing replication or movement of the new files and templates from the newly upgraded server outward to other systems in the infrastructure avoids impacting system stability until the coexistence of the new files with older servers can be validated. Instituting structured practices in this regard will pay dividends to organizations in the form of reduced downtime should incompatibilities occur, especially in a customized environment.


Performance and sizing

It can easily be argued that the Borg care little for end user satisfaction. However, we humans are a bit more demanding in that department. The success of any application depends partly on how well the application performs from the end user perspective. This not only includes end user response time, but also the number of users the application can support compared to the projected end user volume.

Accurate sizing is essential to make sure the application can perform as designed and that sufficient infrastructure is in place to support the application under full load once it's moved into production. Supporting standards for performance will prevent the application from entering production until the design meets or exceeds reasonable performance metrics. This will not only result in happier end users, but also help reduce the possibility of a post-deployment emergency related to performance.


Root cause analysis (RCA)

Let's face it -- bad things happen to good software. The only way to eliminate all possible problems in an IT infrastructure is to never change anything and to never let anybody use it once it is working. (Think of the old adage, "The only people who do nothing wrong are people who do nothing.") Because this is unrealistic in practice, IT teams must be prepared for the worst.

When problems do happen, the tendency is to simply fix it and move on. However, in some cases, this may be a matter of treating the symptom, but not the disease. What is often missing is a process to figure out why the problem happened to begin with, so measures can be taken to avoid future occurrences. Discovering the root cause of a problem is a proactive measure to protect future availability. The process to do this is called (naturally enough) a root cause analysis, often referred to as simply an RCA.

An RCA is a structured process. It requires an investment in both technical skills and time to uncover all possible events that could have triggered a failure. This may involve everything from analyzing logs from network routers all the way up to the logs of the application server infrastructure, then developing and applying preventative measures based on this data.

An RCA can include (but is not limited to) the following steps:

  1. Establish criteria to launch an RCA.
  2. Review the reported incident against the RCA criteria.
  3. Assign an RCA coordinator. This may require a company-defined role and training.
  4. Document and describe the triggering incident.
  5. Define the RCA effort for the incident. This includes identifying participants who can fix and prevent future occurrences.
  6. Define the incident priority based on impact to the business and users.
  7. Schedule RCA meeting and activities.
  8. Research the incident to determine probable cause(s) prior to the RCA meeting. This research involves analyzing all appropriate system logs, change requests, problem reports and complaints, test plans, and defect reports for impacted products.
  9. Model the systems involved and/or impacted.
  10. Convene the RCA meeting.
  11. Brainstorm causes and solution(s).
  12. Draft a report describing the problem in detail and recommending corrective action.
  13. Report findings and recommendations to the appropriate individuals.
  14. Track action items, communications, and status throughout the entire RCA process until the issue is resolved.

For more about root cause analysis, see the articles, "Applying the Fishbone diagram and Pareto principle to Domino, Part 1" and "Applying the Fishbone diagram and Pareto principle to Domino, Part 2."


Conclusion

While the Borg rarely exhibited anything approaching common sense, they did have a process for pretty much everything. For us mere members of the human race, the logical approach to process adoption can help make the job of running a complex IT environment easier. These processes, therefore, can also help manage costs, reduce downtime following failures, and improve end user satisfaction. Processes can also help you reduce the difficulty and cycle time for an upgrade, while systems configured for high availability can offer continued availability during upgrades to new releases. And just to drive that point home, remember that the Borg lost the final showdown -- which just goes to show that you can go overboard!

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into IBM collaboration and social software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Lotus
ArticleID=18026
ArticleTitle=Defining processes in your Notes/Domino environment
publish-date=10042004