Tactics and tradeoffs in a large shared topology

Facts and advice for infrastructure architects and administrators

The challenges of managing a large scale topology are best addressed through planning, proactive actions, and strategic decisions, as opposed to attempting to deploy and manage in a reactive manner. This article will help you identify some of the tactics, trade offs, and critical issues that stand between your infrastructure and large scale success. This content is part of the IBM WebSphere Developer Technical Journal.

Alexandre Polozoff (polozoff@us.ibm.com), Senior Certified IT Specialist , IBM

Alexandre Polozoff is a Software Services for WebSphere consultant engaged in the development of performance practices and techniques for high-volume and large-scale installations. His expertise includes third-party tool evaluations and best practices for performing post-mortem analysis. Alexandre is also involved in open technology standards, such as SNMP, TMN, and CMIP. You can reach Alexandre at polozoff@us.ibm.com.


developerWorks Contributing author
        level

Peter Van Sickel (pvs@us.ibm.com), Consulting I/T Specialist, IBM

<b>Peter Van Sickel</b> is a Consulting I/T Specialist with IBM Software Services for WebSphere. Mr. Van Sickel has over 15 years of experience with distributed systems software development. He got his start in distributed system software with DCE and later the Encina OLTP monitor from Transarc Corporation, which was acquired by IBM. For the past 10 years he has focused on Java EE and WebSphere Application Server based systems including WebSphere Process Server. Mr. Van Sickel holds a Masters degree in Industrial Administration from Carnegie Mellon University, a Masters degree in Computer Engineering from Stanford University, and a B.S. in Electrical Engineering from Pennsylvania State University. Mr. Van Sickel is one of the authors of <a href="http://www.amazon.com/WebSphere-Application-Server-Step-Step/dp/1583470611">WebSphere Application Server Step by Step</a>.


developerWorks Contributing author
        level

Martin Lansche, Consulting IT Specialist, IBM

Martin Lansche is a Consulting IT Specialist with IBM Software Services for WebSphere. Mr. Lansche worked in development for 14 years in such disparate areas as VM System Programming, and C/C++ compiler tools. Martin has worked the last 8 years in ISSW, initially performing C++ services, and more recently focusing on custom WebSphere security solutions including integration within Active Directory environments using Kerberos and SPNEGO. His other areas of specialization include WebSphere performance tuning, troubleshooting, and general WebSphere systems administration. He holds a B.Sc. in Mathematics and Computer Science from York University.


developerWorks Contributing author
        level

Keys Botzum, Senior Technical Staff Member , IBM

Keys Botzum is a Senior Technical Staff Member with IBM Software Services for WebSphere. Mr. Botzum has over 10 years of experience in large scale distributed system design and additionally specializes in security. Mr. Botzum has worked with a variety of distributed technologies, including Sun RPC, DCE, CORBA, AFS, and DFS. Recently, he has been focusing on J2EE and related technologies. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University. Mr. Botzum has published numerous papers on WebSphere and WebSphere security. Additional articles and presentations by Keys Botzum can be found at http://www.keysbotzum.com, as well as on IBM developerWorks WebSphere. He is also co-author of IBM WebSphere: Deployment and Advanced Configuration.


developerWorks Professional author
        level

Tom Alcott, Senior Technical Staff Member, IBM

Tom Alcott is Senior Technical Staff Member (STSM) for IBM in the United States. He has been a member of the Worldwide WebSphere Technical Sales Support team since its inception in 1998. In this role, he spends most of his time trying to stay one page ahead of customers in the manual. Before he started working with WebSphere, he was a systems engineer for IBM's Transarc Lab supporting TXSeries. His background includes over 20 years of application design and development on both mainframe-based and distributed systems. He has written and presented extensively on a number of WebSphere run time issues.


developerWorks Professional author
        level

07 November 2007

Introduction

Managing business critical systems of any size requires dedication and formal processes. Simply put, it's no small task. Managing large-scale business critical systems of hundreds or thousands of applications is even more difficult. Managing and successfully operating an environment at that scale requires tremendous skill and discipline, plus significant preparatory work and solid decision making on many disparate topics. Large topologies are typically unique in that the scale of problems can be magnified due to a ripple effect through the various tiers and layers. Ad hoc approaches inevitably fail and result in business discontinuity (such as system outages).

To help you prepare for, build, and maintain a large scale topology, this article identifies and discusses the most critical issues that must be addressed when creating and managing large scale shared environments. Our experience shows that the most successful organizations proactively address these issues, and organizations that ignore them often have serious problems. This article, therefore, is based on our experiences with the latter: real problems that occurred as a result of failure to address these issues.

IBM® WebSphere® Extended Deployment can help relieve application placement issues, once the relevant split points have been determined. See Resources for more information.

Although it is not possible to address every possible aspect and scenario, the information and recommendations presented here will hopefully help you formulate the necessary decision and split points for the environment you might be planning.


Sharing

One of the objectives of large topologies is to share as much of as many things as possible. Shared rsources might include Web servers, application servers, operating systems, physical servers, networks, and so on. Sharing resources has the potential to reduce overall operating costs, and in some ways simplifies administrative tasks. However, sharing also has disadvantages with respect to non-functional requirements that might be vastly different between the applications sharing the resources. Higher volume, business critical applications might have Service Level Agreements (SLAs) that demand dedicated hardware and software environments. Lower volume, less critical applications are typically better candidates for sharing resources. There is simply no clear cut way to define whether an application is a candidate for a shared resource environment before it is decided whether the application is critical enough to require dedicated hardware. This is a business and operational decision specific to each enterprise.

Organizations can also decide to share environments within discrete and individual business units. There can be political problems if multiple business units with different priorities and objectives try to occupy the same environment, but at some point, someone with authority must make the (popular or unpopular) operational decisions that will affect all applications in a particular shared environment. Therefore, it is not unusual to see environments that are split along organizational boundaries, and with different configurations than their siblings.

Other the barriers to sharing include:

  • The difficulty of testing all applications in a shared environment when a particular maintenance release (such as a fixpack, OS level patch, shared library upgrade, and so on) has to be deployed. Most enterprises do not have unlimited testing resources, so applications that are sensitive to underlying updates tend to be poor candidates to share environments. (If the enterprise is flexible and able to quickly remedy application problems related to maintenance level changes, then sharing is more plausible.)

  • The existence of WebSphere Application Server applications that have remained on older versions of WebSphere Application Server and need migrating to a more current version. Eventually, either the level of WebSphere Application Server (or some other infrastructure component) in use becomes an "out-of-support" product. Migration from a very back level version of WebSphere Application Server can take a great deal of time. While the migration is underway, these applications cannot share infrastructure with more current applications. As applications get "stuck" on various older levels of infrastructure components, less sharing exists, and unique back level environments remain in use in the enterprise.


Consistency

Consistency in large topologies makes things simpler. Once changes are introduced, the environment becomes more complex. The reality is that no large topology can remain homogenous across the entire enterprise. There will be differences between these environments. The enterprise has to understand the responsibilities of the teams involved and the rigor necessary to execute these processes in order for the management of these environments to be successful.

The other challenge large topologies have to contend with is the fact that there are easily thousands of variables that can be changed in the environment. From the network layer on through each tier of the environment, there are a multitude of configuration items that can impact the environment.

Change control

No environment, large or small, can survive without strictly adhering to change control procedures. The larger the environment, the more impact change control has on the success or failure of that environment. When a production (or any other test environment) outage occurs, the administrators can efficiently extract information from the change control tooling about what changes occurred in the environment. Organizations that do not have vigorous change control end up suffering prolonged outages as the administrators try and determine what variables changed and when.

Change control must be conducted using tooling specific to the change control process. This specifically rules out the sole use of spreadsheets and text documents, except where used to provide additional information to the base changes. Changes should start with a work order that is submitted and approved by some authority. Each change is tracked through the work order identifying who made the change and when.

Configuration management

In many cases, administrators might not touch a particular environment for several weeks at a time, and so it is not reasonable to expect them to remember every nuance of every individual environment. Some form of record keeping about each environment is necessary, and this is where configuration management tools come to play. Typically integrated with the change management tools, the configuration management repositories hold the key information administrators need about their environment's configurations (see Resources ).


Staffing

In any organization or enterprise with a significant IT infrastructure, there are a number of required roles to be be staffed. In large scale environments, it becomes even more critical that the right people are in these jobs. Some roles require they be filled by one dedicated person. In other cases, some people can fulfill more than one role. It is usually a strategic mistake to leave roles unfilled or to have one person filling too many roles. A list of these critical roles is below. Of course, this will not be a comprehensive list for every organization, since most will have additional roles that are also important, such as database administrators, network administrators, and so on. The roles listed here are those that are universally critical to success of a large scale topology -- and most often overlooked.

Most of the individual roles will typically be assigned to a single person, although it is possible in larger environments that teams might be assigned to support these responsibilities. Other roles are identified as teams below to emphasize that more than one person typically performs the tasks associated with those roles, even in smaller environments.

  • Enterprise Architect

    The Enterprise Architect has an understanding of both application development and system infrastructure, and can ensure that the decisions made at the intersection of these disciplines are effective, well thought out, and properly implemented. The Enterprise Architect:

    • Should have both an infrastructure and application background with an understanding of the implications about decisions around topology (for example, the pitfalls of sharing a deployment manager with both test and production environments, one cell vs. many cells, and so on).
    • Should understand the implications of security, but is not necessarily a security architect.
    • Can perform analysis of application architectural decisions and changes (for example, what the impact of switching to the distributed cache has on the infrastructure).
    • Can make application framework decisions (for example, buy/implement, what functions, and so on).
    • Can create application coding standards documentation, including coding best practices for performance.
    • Can conduct reviews and audits of application architecture, design, documentation, and so on.
    • Can perform infrastructure reviews and audits.

    In addition to these tasks, which one might expect from an Enterprise Architect, one key but often overlooked activity is the need to review change control requests. Any change that is not routine should be reviewed by an Enterprise Architect. Architectural changes with often undesirable implications have a tendency to be made "when no one is looking." It is critical that changes be reviewed to prevent this.

  • Infrastructure Architect

    This person is a technical leader on the service provider side of the IT organization, and is capable of building an infrastructure based on requirements and constraints provided by the business. The Infrastructure Architect works under the overall leadership and guidance of the Enterprise Architect to ensure that the infrastructure meets a given application's requirements, and also needs to work closely with the Security Architect to ensure that the operational environment is suitably hardened from a security perspective.

  • Application Architect

    This role focuses on the application (or suite of applications) and requires extensive development experience. The Application Architect understands the trade-offs of many design and implementation decisions, defines coding standards, decides which open source projects will be used, and makes many other strategic development decisions.

  • Security Architect

    This role needs to be filled by someone with a deep security background who understands security best practices in development and can perform code audits from a security perspective, as well as understands security infrastructure best practices and how to mitigate threats. This person is responsible for creating overall security policy and standards, and works closely with the Enterprise Architect to ensure that policy is enforced.

  • Build Team

    Software development ends in a build process, a disciplined and automated process that builds the application from source code. A typical build process extracts the source code from the software repository, builds the application using automated tools (such as ANT scripts), and then tests the built application to ensure proper operation of basic functionality before the build is promoted to the formal, non-development testing environments. The deployment scripts used to deploy the application are also tested as part of the build testing process. Large organizations with many applications find it useful to centralize the build process using a team that become experts in builds. By centralizing the process, uniform processes and procedures are developed and reused across all applications. This not only improves build consistency, but enforcing commonality also tends to reduce deployment issues.

  • Performance Lead

    Performance testing is a demanding discipline with its own unique set of practices. The leader of the performance team:

    • Creates performance test standards.
    • Participates in design and code reviews with a performance perspective.
    • Leads the performance testing process, working with the Test Manager.
    • Evaluates and recommends tools for performance testing and monitoring.
  • Test Manager or Test Coordinator

    This person manages the overall testing effort, which includes defining and working with the infrastructure team to build the appropriate set of test environments (of the appropriate scale) to support all testing functions, from development through staging for production. This role also involves:

    • Developing test plans and use cases.
    • Working with test developers to implement the plan and execute the use cases.
    • Recording historical data, test results, application monitoring tool data, and so on.
    • Planning test environment usage, working with the Enterprise Architect.
  • WebSphere Administrator

    This role is frequently understaffed, particularly once more test environments are put in place and formal processes are introduced around the development and code promotion process. WebSphere Administrators must:

    • Perform administrative tasks per Enterprise Architect or Infrastructure Architect direction.
    • Although the WebSphere Administrator not make infrastructure decisions, he will work with and contribute to the efforts of the Enterprise and Infrastructure Architects
    • Be capable of developing deployment scripts to minimize the use of the WebSphere administrative console in production, making the actual deployment faster, accurate, and repeatable.
    • Have basic host operating system (Linux®, AIX®, Solaris™, and so on) administration skills.
  • Troubleshooter

    Although not typically an officially designated role, there will be a need for a senior technician with the advanced skills needed to lead any problem solving efforts that will be necessary when issues are uncovered during testing and (hopefully, rarely) in production. This person must:

    • Have a background in both application and infrastructure.
    • Be able to use application monitoring tools (such as IBM Tivoli® Composite Application Manager) for diagnosis and root cause analysis. (Organizations that do not use application monitoring tools tend to resort to using OS level commands, such as kill -3, which have a severe negative impact to the end user experience. In addition, application monitoring tools tend to have finer grained data than OS level tools (like CPU utilization by process) and can record historical data for use in a post mortem analysis. While not in the purview of the Troubleshooter, the data from application monitoring tools can also be used for capacity planning.)
    • Be able to work with the Test Coordinator to collect vital metrics for analysis.
    • Be able to work with the Enterprise Architect on problems that occur in the environment, and provide the necessary feedback to spur changes in either the code or infrastructure.
    • Be able to mentor and train other technical staff on troubleshooting skills. Since troubleshooting is more art than science, the best way to become good at it is to learn from someone through experience. Thus, mentoring is a critical part of the Troubleshooter's role.
  • Test Team

    This group must be able to develop automated test scripts for load testing and for functional testing (these might be different scripts),execute the tests, and take direction from the Test Coordinator. All of the test scripts are driven by business use cases for the application. Automation of both functional and load test scripts are specified here because the cost of manually testing a multitude of applications with every change quickly becomes unaffordable. Automation is the only viable approach for frequent application testing.


Terminology

The first problem you might encounter when attempting to develop an infrastructure environment for a large topology involves basic terminology. Every organization that is embarking on a large topology deployment should take sufficient time to define the components, pieces, and parts that will be deployed to make sure that all involved have the same overall understanding. Likewise, the next few paragraphs will look at some common terms that will be used throughout the remainder of this article. Since every enterprise is different, each might have its own definitions of these common terms. Therefore, these terms are introduced here as questions to trigger the discussion that will lead your organization to define these terms for itself. Regardless of the specifics, it is critical that agreement and a common understanding of these terms be reached with everyone involved.

  • What is an application?

    While this might be a seemingly innocuous question, it has severe implications to the final deployment. To the Enterprise Architect, an application could be something defined at a high level, such as finance, research, online processing, human resources, and so on. However, within each of these high level descriptions are applications (or a suite of applications) that perform discrete functions which together actually make up the "application." It is important, therefore, to be clear on what your organization defines and refers to as an application.

  • What is a service? Is a service an application?

    A service is a component of Service Oriented Architecture (SOA). Services provide interfaces that applications can reuse in order to execute some function remotely. For your environment, you need to know:

    • Are services considered applications or are they part of an application?
    • Is a new version of a service a new application?
  • What are modules and libraries?

    To muddy the waters, a discrete application (in this context, meaning a specific application EAR file that is deployed to the application server) is comprised of various application "modules" and packaged "libraries." The modules might be written by in-house developers or they may be third party or open source software libraries. These components play a significant factor in packaging and deployment.

  • What are versions?

    Software has different versions. The operating system on different machines in an environment can be at different versions. WebSphere Application Server has different versions. Components of an application, including modules and libraries, can be at different versions. You must carefully document version dependencies between supporting software and components (that is, application X requires WebSphere Application Server V6.1 Fixpack 5 running on AIX v5.3 release maintenance 6, and so on). Does a new version of an application mean that it is a new application? Will multiple versions of the same application run concurrently, or does a new version supersede an older version?

  • What is a node and a server?

    To some, a "node" is a physical server. To others familiar with LPAR technologies, a "node" is an instance of an OS, and several LPARs can reside on a single physical frame. While not usually a major stumbling block, it is better to have hardware definitions clearly delineated so when someone says, "We're going to need 100 servers," everyone will know that that means four frames with 25 LPARs on each physical frame, and avoid the accidental purchase of 100 physical frames. You can laugh, but this has happened.

    Similarly, it is good to get into the habit of using complete, explicit terms when discussing "application servers" or "physical servers." In lay terms, a "server" is a physical piece of hardware, but to a WebSphere Application Server administrator, it represents a JVM. Simple, explicit clarity can eliminate much potential confusion.

  • What is considered sensitive data?

    Some organizations have regulatory definitions that explicitly define items considered "sensitive data," such as social security number. Other items, such as personal information that is easily determined through public records (like mother's maiden name, address, phone number, and so on) are not so easily classified. Classifying data elements is an important task in determining whether an application is accessing data that your organization (or industry) considers sensitive.


Architectural decisions

Every decision made by an organization (related to application architecture, infrastructure architecture, administration, packaging, deployment, and so on) must be clearly documented, and those documents should be archived and be readily available to all team members responsible for any part of the operation's success. There is plenty of technology available today, from document management systems to wikis, that make this relatively straightforward and easy to achieve. The primary reason for documenting the system architecture, and the decisions that were made in arriving at that architecture, is to provide history and context for the ongoing maintenance and evolution of the environment. The biggest cost of any infrastructure is ongoing maintenance. Since people naturally move on to new projects and positions, it is expected that the people who maintain the environment will be different from its originators. Those responsible for ongoing maintenance will have an easier job if there is a good record of how the architecture is designed and why a particular environment is as it is. Follow-on enhancements to the environment are more likely to go smoothly if those performing the enhancements understand the environment.

Any architectural decision document should include at least the following information:

  • Problem statement

    Every problem statement should clearly define the problem or issue facing the enterprise. A good problem statement leaves multiple alternative solutions. The problem statement should include a question that captures the essence of the problem, for example: "Should we use SSL behind the Web servers?"

  • Assumptions and implications

    Any problem statement has certain assumptions and implications, and these should be stated so that anyone reading the decision document has a clear understanding of the environment at the time of the decision. Assumptions and implications can change over the years as enterprises mature and evolve.

  • Business requirements

    Business requirements related to the problem, if any, should clearly delineate topics such as Service Level Agreements and non-functional requirements. The business must be able to define these requirements, otherwise the decisions made will be flawed. In the previous example, the decision to use SSL is likely driven by business security requirements, and thus they should be referenced here.

  • Possible solutions

    Every problem will have multiple possible solutions, and each solution should be documented, at least in summary. This would include changes to the application code, operational changes, hardware alternatives, infrastructural architectural alternatives, and so on. If all possible solutions are not investigated, you can easily shortchange yourself and possibly use a more expensive solution than was required. The flip side, of course, is that you must avoid "analysis paralysis" and actually produce decisions. Therefore, while all possible solutions should be given serious consideration, it is not necessary to examine each one at the lowest level of detail -- this is, after all, an architectural document.

    Finally, and most importantly, documenting all considered possible solutions will result in a history for future readers to understand what was considered (and, by absence, what was not). When business requirements or technology changes, this history will help position the original decision and help guide new decisions.

  • Chosen solution

    From all the possible solutions, a decision is ultimately made on the one that will be implemented. This section should explain how a solution does or does not meet the business requirements.

  • Justification

    The chosen solution must be justified by showing how the solution best meets the needs of the business. This section must clearly explain how the chosen solution meets the business requirements, and why it is the best solution.

  • Review period

    Each solution should include a mandatory review period. Business requirements can change, as can application functionality, architecture, non-functional requirements, Service Level Agreements, and so on. Scheduling periodic reviews throughout the application's lifecycle is important. Review periods can be either time-based, or triggered by infrastructure changes or milestones. For example, an obvious time to review a Java™ EE architectural decision is when the application server is upgraded to a newer major release that might include new functions that could change the basis for previous decisions. At the very least, changes can be identified as they occur, and new architectural documents can be produced. (Changes to architectural documents imply change control of the documentation.)


Security considerations

Security must be a primary criteria in every architectural decision. Failure to consider security can lead to ill-considered decisions from which there is no turning back; retrofitting security after the fact can vary from difficult to nearly impossible. Unfortunately, security is one topic that is often dismissed or dealt with as an afterthought. Application isolation is an aspect of security that relates to a shared infrastructure.

Trust is an important factor in any relationship, be it human or technical. If an application is to cohabitate in the same security realm or on the same hardware as another application, then there has to be some level of trust between the two. If at any point trust cannot be established, then the appropriate separation should be applied.

Application code can come from a variety of sources, whether it be developed in-house, from a third party vendor, and so on. Regardless, for two applications to reside in the same trust domain they must be trusted. WebSphere Application Server does not provide for secure application isolation within a cell, and so, if two applications are to share the same cell, they must completely trust each other. While there are expensive and difficult steps that can be taken administratively to partially isolate applications within a cell, they are not necessarily complete. (See Resources)

In many cases, an application's trustworthiness can only be determined through an exhaustive code review; this means an exhaustive code review of each version of the application deployed into the environment. Since studies have shown that code reviews reduce costs by improving quality, this should not be considered a burden. If it is, however, then either the applications must be placed in separate cells or application isolation must be discarded as a requirement.

You might argue that Java 2 security is available to protect applications from each other, but this is not the case. First, there is considerable additional administrative expense associated with deploying applications to environments where Java 2 security is enabled. Configuring policy files for an application to run with Java 2 security requires detailed knowledge of the application and considerable knowledge of the runtime environment, which is not available in many organizations. Any policy file definition mistakes made or short cuts taken leaves open vulnerabilities that can not be easily identified. Managing these files will become a significant burden.

A sensible compromise might be to share cells for most applications, but place the most sensitive applications in their own high security cells to which you have applied additional "expensive" requirements, such as mandatory code inspections and the use of Java 2 security.


Performance considerations

Performance and end user experience is ultimately one of the major driving factors for any enterprise. If the end user experience is at all negative in terms of response time and accessibility, it could mean perhaps losing a sale to a competitor's better performing Web site, or losing money if users turn to the more expensive-to-run call center instead of the online application.

High volume applications are also unique in that they typically consume large amounts of CPU or memory resources. Scenarios with high volume applications are generally better suited for segregation from the rest of the environment, as they simply do not coexist well with other applications because of their resource consumption requirements. Typically, high volume applications provide a core business function. It is imperative, therefore, that they use whatever resources are available and so should be segregated.

Low volume applications that might be used less frequently or on a scheduled basis may or may not be good candidates for segregation. It really depends on the application's value to the business and resource requirements. For example, reporting applications tend to be large consumers of resources. Even though a reporting application may be used infrequently (such as during month end financial reporting), it could require segregation because of its high resource consumption. In contrast, there might be some internal applications without heavy resource requirements that are not business critical which might be good candidates for cohabitation with other similar applications.

A team of people, including operations and line of business representatives, need to make these subjective decisions on application placement. The additional cost of hardware and operational support will factor into each decision.

Solutions

Applications must be thoroughly tested for performance beginning in early in the development cycle. Typically this means that initial performance testing, likely with "stubbed code," should start before the project is 1/3 complete. To be successful, all use cases will have been identified, the appropriate scripts will have been written to exercise those use cases in the expected production load, and the testing environment will have been built with enough resources to mirror the production environment, though likely on a smaller scale. The reality is that performance testing is often minimalized and not often considered until very late in the process, which is a big misstep.

The best performance environment is one that replicates production both in the middle tier WebSphere Application Server, and in the backend (such as databases, mainframes, and so on). While this might not be feasible for every organization, every attempt should be made to create a performance testing environment that is representative of reality. Testing in production is a well documented anti-pattern and is, unfortunately, a common way to debug problems; debugging in production is usually very expensive, creates a fragile environment, and often results in a negative end user experience.

Beyond the environment, testing also requires test teams, coordinators, planners, and, where possible, dedicated hardware and networks. The team that identifies the use cases generally provides the input to the test team. The test team then takes the use cases, builds the test scripts, and executes the test plans developed by the test planner. System administrators provide the associated administration of the test environments to ensure that they mimic production as closely as possible.

See Resources for information on performance testing.


Versioning

Versioning occurs for every component in the enterprise, and is one of the more challenging things to manage . In some cases, versioning is planned; for example, a new version of an application that is expected to be released in three months. In other cases, versioning is unplanned and sometimes performed in an emergency; for example, when applying a fix to the OS or the application server. A strong versioning process results in a production environment with integrity.

While version creep will occur, every effort should be made to maintain as much consistency between environments as possible. The fewer the differences, the less environment-specific knowledge the administrators must keep track of to sufficiently manage all of the environments in a large topology. However, to do their job, an administrator need to get the information about what has or has not been deployed or changed in each environment in the first place -- which brings us to the next topic: change control.

Change control

Change control and the processes that surround change control help an enterprise successfully manage a large topology environment. Systematic change control processes that start with open tickets and move to individual approvals and work orders through a packaged application solution is the best way to manage change. Attempting to save money by using some combination of spreadsheets and a non-integrated workflow notification mechanism (for example, e-mail) do not provide the requisite discipline.

  • Backup strategy

    Backup is a critical change control function. Before any changes are made, it is vital that a backup of the environment is created and then tested in a restore operation somewhere to verify the backup is not somehow corrupt. Few situations are as distressing as having a physical server crash and discovering that your backup is useless and then that your machine must be rebuilt using bare metal procedures.

  • Inventory

    A prerequisite to any change control process is an up to date inventory of hardware and software. An accurate IT inventory can help assess risk when problems occur and when updates or patches need to be applied. The inventory should cover the entire IT infrastructure including:

    • Production systems
    • IP addresses
    • Patch status
    • Patch level
    • Vulnerabilities
    • Physical location of the patch
    • Custodian of the patch
    • Function of the patch.

Hardware versioning

Ideally, all hardware in a large topology environment is built from a master image. As fixes need to be applied to the hardware, the master image is also updated. This way, new hardware is introduced into the environment only as a duplicate copy of the other machines.

Software versioning

Software versioning has many facets. No matter how software versioning will impact the environment, software versions should be applied in a cycle through the test environments. Applying a patch first to the development test environments (or perhaps to a system administration machine used for change testing before changes are introduced to development). Once tested and deemed stable, the change can move to the system integration test environment, then to the performance test environment, and then through the QA and staging environments prior to production. Testing, of course, occurs at each stage before continuing on. The process can take from several days to weeks, depending on the scope of the changes, and whether any problems were encountered at any testing stage.

Security fixes should be given top priority; these should be considered "hot fixes." When a vendor issues a security fix alert, not only do administrators learn of the vulnerability, but those who try to exploit such holes might be learning about the problem as well -- if they didn't know already.

  • Operating system

    Operating systems need periodic updates. OS level fixes can be planned and should exist as part of the annualized project plan, at least once a quarter.

  • WebSphere Application Server

    The WebSphere Application Server is the equivalent to an operating system in that it provides functionality to the application. Much like an OS, there are periodic updates that should be deployed on a regular basis throughout the year. Of course, it is not feasible to deploy each latest update as it becomes available, but you can perform and complete the update cycle at least every 6 months; preferably once per quarter. However, sometimes hot fixes should be applied off schedule because of some unique interaction between the application and the application server.

  • IBM stack products

    Several IBM products install over WebSphere Application Server, such as WebSphere Portal, WebSphere Process Server, WebSphere Enterprise Bus, and ITCAM for WebSphere, among others. IBM stack products should be managed the same as any other third party vendor application.

    Keep in mind that IBM stack products tend to have rigid version requirements on WebSphere Application Server; one product might require V6.0.2 fixpack 11, and another may require fixpack 21. This can be difficult to manage if the stack products are living in the same shared environment. Likewise, stack product major version updates (such as from V6.0 to V6.1) can entail significant change at both the stack product and WebSphere Application Server levels. Segregating the stack products into their own separate cells is probably a good step to take, but sharing physical resources is usually fine unless there are high volume requirements involved.

  • Application

    No matter how "application" is defined in your environment, two types of application updates can occur: scheduled version updates and unscheduled hot fixes. Being prepared and scheduling known release candidates in the project plan involves communication between the development and operations teams. Also, WebSphere Extended Deployment provides functionality for helping manage application versions in a production environment (see Resources.

  • Service versioning

    Reusable services that are deployed and made available to applications will continuously change. When changes occur, you must consider the changes to a reusable service as well as the impact of those changes to every application that uses the service. These changes can be:

    • Behavioral change: A business decision must be made to implement behavioral changes. If the client of this service is of high importance to the enterprise, then changes within the service merit functional retesting of the client.
    • Introduction of a new interface: A new interface that does not introduce a change in behaviour to an existing interface typically will not merit a thorough testing by clients. Of course, if the performance characteristics of the service change due to the new interface, then a performance re-test by the high importance clients is warranted.
    • Deprecation or removal of an old interface: This is a very difficult thing to achieve, especially in environments where a service has been made available within the enterprise, and the service provider is not actually aware of all of its clients.

Stack component licensing

The licensing conditions of some stack components may be a strong factor in considering application co-location. Applications using shared licensed drivers might enable you to save significant costs versus each application purchasing and installing their own copies of these licensed "applications". Such "applications" range from stack products, to simple runtime libraries used by the enterprise's applications and services.


Stability and bad applications

Stability is the holy grail of any production environment. "Bad" applications can wreak havoc in any environment, but can be particularly problematic in large shared topologies. Therefore, any applications that have some negative (bad) affect on available resources should be segregated from the general population. High volume applications can be bad from the perspective of their high CPU and memory requirements.

Performance testing applications is the right way to identify bad applications before they get to production. The other way is to deploy applications into production without testing, which, of course, is problematic since the application will have already caused production problems.

Before you can manage a large topology environment and provide stability, you must be able to peer into the application server and monitor how the application is running. Every production and performance test environment must have some application monitoring strategy and should use monitoring tools, such as ITCAM for WebSphere. See Resources for more on application monitoring.


Packaging and deployment

Packaging the application EAR

Although not typically thought of when considering affects on shared environments, packaging can cause issues in some environments if an application is packaged or deployed in an odd manner. The common practice for packaging an application is for all the code components, modules, and libraries to be packaged in the same freestanding EAR file. An application packaged in this way is easier to deploy and manage. A suite of applications would be deployed as multiple EAR files. (You can run into service versioning issues if changes occur to the interface, but that is beyond the scope of this article.)

Multiple deployment units

One common problem when applications are not deployed as freestanding EAR files is that the administrator now has what we will term "multiple deployment units," which now have to be managed as a single component. The complexity of multiple deployment units can be difficult for an administrator because they are typically deployed as shared libraries (libraries in the shared classloader). Applications deployed under these libraries might require different versions of those libraries. This methodology starts to encounter issues when a new version of an application requires a new version of a library that is not backward compatible. Any other application deployed as multiple deployment units underneath the same set of libraries now must also be updated; either that or the environment must now be split once again. Administratively, multiple deployment units are difficult to manage and might have issues when a new library version is introduced. In addition, while it is possible to perform "hot replacement" of applications packaged within a single EAR file, such replacement of multiple deployment units is much more difficult to achieve.

Scripting

Applications are developed to provide automated and repeatable processes for business requirements. Scripting provides the same automation and repeatability for application and environment administration. Scripting is mandatory in a large scale topology. Even small environments benefit from scripting simply due to the reproducible action scripts provide. Standard change control mechanisms should be applied to the scripts. Scripts can also be tested in development, system integration, performance, QA, and staging before they ever touch a production environment.

Resource scoping

One of the recent changes to the JNDI namespace is the ability to scope the resource definitions at various levels in the environment (server, cluster, node, or cell). Scoping everything at the cell level might seem easy since a resource only needs to be defined once and every application server can see it. However, in a large scale environment this leads to unnecessary resource definitions in the namespace of many application servers that don't actually use the resource, so called namespace "pollution." This may lead to namespace collisions, but more likely it just makes JNDI problem determination and resolution more difficult. Your resources should be defined at the cluster scope to avoid namespace pollution and keep resource definitions visible only in the servers where those resources are actually used.


Conclusion

The management of large topology environments involves a considerable number of options and corresponding decision points. Organizations that are either new to WebSphere Application Server or that are planning on large topologies can use this article to guide their future strategy and planning. The best way to tackle the challenges of growing large topologies in an enterprise is by taking the time to document all the decision points and the trade offs that are acceptable. Those already managing large topologies might recognize some of the trade offs and tactics presented here. Hopefully, the information in this article will help validate that the right decision(s) were made for those environment. Others might find that some of this information presents a different way of addressing the same fundamental problem. The challenges of managing any size topology (and especially large topologies) are best addressed through planning, proactive actions, and decisions, as opposed to attempting to deploy and manage in a reactive manner. We hope this material provided will enable you to proceed in the former manner, rather than the latter.


Acknowledgements

The authors would like to thank Paul Ilechko and Steve Linn for their assistance.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=266672
ArticleTitle=Tactics and tradeoffs in a large shared topology
publish-date=11072007