The Support Authority: Why testing in production is a common and costly technical malpractice

Do you know what affects the stability of your enterprise IT infrastructures? This article discusses a common characteristic that the IBM® WebSphere® Application Server SWAT team has observed while assisting clients with complex situations: either they do not have a separate test system, or the test system they do have is substantially different than their production system. If this is a characteristic that your environment shares, then you need to be aware of the destabilizing nature of this "malpractice" – and you need a plan for addressing the situation to improve stability. This content is part of the IBM WebSphere Developer Technical Journal.

Dr. Mahesh Rathi, WebSphere Application Server SWAT Team, IBM

Dr. Mahesh Rathi has been involved with WebSphere Application Server product since its inception. He led the security development team before joining the L2 Support team, and joined the SWAT team in 2005. He thoroughly enjoys working with demanding customers, on hot issues, and thrives in pressure situations. He received his PhD in Computer Sciences from Purdue University and taught Software Engineering at Wichita State University before joining IBM.



02 February 2011

Also available in Chinese Russian

In each column, The Support Authority discusses resources, tools, and other elements of IBM® Technical Support that are available for WebSphere® products, plus techniques and new ideas that can further enhance your IBM support experience.

This just in...

As always, we begin with some new items of interest for the WebSphere community at large:

  • Are you ready for Impact 2011? Join us for Impact 2011, April 10-15, 2011 in Las Vegas Nevada, at The Venetian and The Palazzo Hotels. Register before February 18 and receive an Early Bird discount. And check out these top 5 reasons to attend Impact (PDF, 115KB), the one conference where business and IT leaders can explore together how to achieve greater business agility.
  • Earlier this year, the IBM Support Portal was named one of the Top Ten Support Sites of 2010 by the Association of Support Professionals. Have you tried the IBM Support Portal yet? All IBM software products are now included, and all software product support pages have been replaced by IBM Support Portal. See the Support Authority's Introduction to the new IBM Support Portal for details.
  • Check out the IBM Conferences & Events page for a list of upcoming conferences
  • Learn, share, and network at the IBM Electronic Support Community blog on developerWorks.
  • Check out the new Global WebSphere Community at websphereusergroup.org. Customize the content on your personalized GWC page and connect to other "WebSpherians" with the same interests.
  • Several exciting webcasts are planned in February at the WebSphere Technical Exchange. Check the site for details and become a fan on Facebook!
  • IBM Support Assistant 4.1.2 is now available. IBM Support Assistant 4.1.2 delivers several defect fixes and a new version of its quick data collection tool, ISA Lite. These new features are now available in ISA Lite:
    • Once your inventory is collected in ISA Lite, you can easily view the inventory in a browser.
    • ISA Lite uses Ant 1.8, leveraging the latest technology available.
    • You can more easily view the menu options because each menu option running in console mode has a number associated with it.
    • You can pause the processing of a response file in different scenarios:
      • Pause the response processing for a defined period (for example, to collect trace).
      • Pause the response processing until a console response is received (for example, to enable a problem to be recreated, which is a step in a lot of plug-in scripts).
    • ISA Lite now has a more extensible format, using name-value pairs for the response file, which enable:
      • Users to add comments inside the response file.
      • Response files to contain name-value pairs that cover all the relevant questions.
      • Users to easily edit response files by hand for customizing for different systems.
    • ISA Lite version information is automatically written to the console and log at startup; the same information can be found under the Help menu, and by passing -version to the start scripts.
    • You can run ISA Lite with the -help option to view information on how to use the tool.
    • You can select an alternate file transfer option when the selected file transfer fails.
    • A visual indicator shows if the collection has ended successfully or failed (the progress bar is green when collection is completed with no errors, or red when it fails).
    • No files are written to the ISA Lite installation directory if ISA Lite is run with the -useHome option.
    • Added support for Windows 7 and Linux RedHat 64-bit.
    • Added inventory collection support for Solaris.

Continue to monitor the various support-related Web sites, as well as this column, for news about other tools as we encounter them.

And now, on to our main topic...


New year, new approach

Since The Support Authority began, this column has presented several of the major initiatives and tools that have been (and continue to be) developed within the IBM WebSphere Support community. In the new year, we’ll be taking a slightly different approach. In addition to articles on utilities and resources that can help enhance your support and self-help experience, we will also present several articles rooted in our experience in our customer facing roles. We begin here with one of the most common “worst practices” that we encounter: the insufficient test environment.


Infrastructure malpractice

The IT industry places a good deal of focus on establishing design patterns and practices that help manage the complexity associated with enterprise scale projects. IBM’s clients benefit by following such industry best practices. The emphasis on the positive, however, means that users are often not exposed to the details of what makes IT infrastructures unstable until they experience a business level outage firsthand.

"Web site down" can be a nightmare for any online business. Besides any potential revenue loss, an outage can affect a vendor’s integrity and possibly open the door for competitors. Many technology companies rank stability as the most important factor when they choose products, even though many do not always have a clear understanding of the actual cost of product outage.

Worst practices, or in some case "malpractices," if you will, are exactly those factors that make an IT infrastructure unstable and lead to business level outages. No two business level outages are ever identical, but they do often share common factors between them. Here is a list of some of the most common and costly of those factors:

  • Test environment differs from production.
  • Communication breakdown.
  • Blind to application state.
  • Changes put directly into production.
  • Insufficient capacity or scalability plan.
  • Incomplete or insufficient product knowledge.
  • Incomplete or outdated architecture plan.
  • Vague production traffic profile.
  • Inadequate load or stress testing.
  • Incomplete record of changes.
  • No migration plan.

Organizations commit such malpractices essentially for the same reasons: the presumption that performing the best practice is too costly or not necessary, and that cutting corners will save time and money. In reality, an organization might somehow realize a short-term savings, but there are inherent risks that can inevitably lead to business level outages and long-term losses that can far exceed any short-term savings.

Most IBM WebSphere Application Server users have some type of test environments. However, many of these environments are not technically identical to the production system because of the cost factor. As perhaps you can imagine, we have seen cases where the lack of a separate test environment has caused a product outage. During our visits to client sites over the past several years, we have recorded occurrences of malpractices, such as those listed above, if they directly caused or exacerbated a business level outage. Our analysis has showed that the most frequent factor contributing to a highly critical situation experienced by a WebSphere Application Server user was that either the test system was not technically identical to the production system, or that a test system did not exist at all. This was the case in over 35% of all the situations we visited –- and is trending up.

The remainder of this article looks at this particular malpractice, and why this situation can become a big deal.


When your test environment differs from production

This situation includes:

  • The test environment is too small, or it is overcommitted and not available when it is needed.
  • The test hardware, network, or software levels differ from the production environment.
  • The z/OS® LPAR on the same machine or network is not isolated from the production system.
  • The configuration settings for the test systems are different from settings for the production systems.

Let’s look at a sample case study to better understand why this is a malpractice.

A major bank discovered that their WebSphere Application Server for z/OS environment experienced frequent product outages during tax season, which disrupts online tax business for their customers, which in turn is harmful to the bank’s relationship with its customers. The IBM WebSphere SWAT Team investigated and found many factors that contributed to this situation, primarily configuration errors.

One mistake in particular was that the bank had created two application servers on a single installation of WebSphere Application Server base: one of the application servers was the test server and the other was the production server. Because both application servers ran on the same base WebSphere Application Server, their logs are shared, their ports are configured to avoid conflict, and, most importantly, because they ran on the same binary codebase, any upgrade to the Software Development Kit (SDK) would disrupt both application servers. While frequent updating to the test system is necessary, the repeated disruption to the production system was intolerable.

Recognizing this problem, the SWAT Team urged the bank to separate the application server test and production systems on different logically partitioned modes (LPAR) so that they would not affect each other. Any subsequent problem determination necessary after the separation is completed would be exceptionally easier.

Some developers fail (or choose not) to create a separated test system for seemingly practical or economic reasons. It might be convenient to have the test and production system together, or it might look wasteful to assign a dedicated test system. You must understand that a convenience such as this comes at a price -- and a production system outage can be very costly.


Possible solution and preventive care

Simply, you should have a separate test system that is identical to your production system. You can conduct load and stress tests on the separate test system before moving the application to production, where it should expect to work without disruption. Having a separate test system has many advantages, including:

  • Prevents unintended production disruption from test activities.
  • Provides a platform for performing functionality and integrity tests before performing major upgrades.
  • Provides an environment for duplicating production problems and testing fixes.

Approaches for creating a test environment vary in cost, planning, and risk. Having the ability to test the application environment -- not just the application -- is a necessity. To truly benefit from a separate test environment, it is imperative that you:

  • Understand that a test system is a must, not a luxury.
  • Do not use a production system for test purposes.
  • Physically separate a test system from a product system and, ideally, have two different groups of operators manage these systems.
  • Clearly label the test and production systems with a different set of security identities.
  • Install identical software on the test and production systems. Similar hardware for the both systems is preferable.
  • Upgrade WebSphere Application Server and the applications themselves on the test system first before upgrading the production system.

From a practical standpoint, the type of test environment you have will depend on the the availability of resources. Starting with the most desirable, variations include:

  • Maintaining an exact duplicate of the production environment.
  • Maintaining a scaled-down environment with load generators that duplicate the expected load.
  • Maintaining a model or simulation of the environment.

In all cases, you should implement and test all changes in your test environment before you consider adding them to your production environment.


Conclusion

Extensive cost cutting can sometimes cause businesses a loss rather than a savings. This article explained how having a separate test environment that is equivalent to the production environment is fundamental to any successful information technology project. In a rapidly moving world, it is difficult enough to follow a meticulous process and keep a record of changes, both large and small, but maintaining separate purpose-built environments not only makes it easier to keep track of such changes, it greatly reduces the risk of implementing and using them in production.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=620396
ArticleTitle=The Support Authority: Why testing in production is a common and costly technical malpractice
publish-date=02022011