Systems software maintenance is similar to "Painting the Forth Bridge". When reaching the end of a round of updates, it is time to start the next.
In a world where software updates are released almost daily and mean time between failures (MTBF) is literally a waiting game before an item of hardware breaks, the endless cycle of planning and fixing looks very much like it is here to stay.
Generally, when it comes to technology, lightening performance and around-the-clock service has become an expectation, without exception.
Waiting for the response from a mouse-click or for a script or job to finish is frustrating, and system unavailability is intolerable. What this means for any business is a necessity to set-sail for Uptime Utopia, to strive for seamless, instant and endless availability.
Outages, regardless of the cause, have consequences. The ideal of "100% uptime" is eroded and costs to the business can be measured in large monetary terms and reputation.
Many businesses are dependent on their IT systems to the point where extended outages have dire commercial consequences. The woe-betide word of downtime has the uncanny ability to make months and months of uptime and reliability appear relatively insignificant.
What immediately comes to mind when there is an outage is that something has broken; a system has failed, an application has crashed, and so on. In reality, planned outages are responsible for a fair proportion of downtime and as such, all intentional disruption should be minimized as far as possible.
There are numerous articles that have been written that outline the importance of system maintenance and the perils of not addressing it. A very good example was written by Anthony English.
What is sometimes not obvious is that we do not simply manage AIX or the Virtual I/O Server (VIOS), we manage parts of the service, and there are a number of other factors to consider.
This article expands on the other factors and components to help define an enterprise maintenance strategy for AIX on Power Systems™ servers.
The key objectives of the strategy are:
- To minimize planned disruptions to the service
- To increase system availability through careful planning
Effective enterprise maintenance
Defining the service
Enterprise-class environments on IBM Power servers consist of multiple components. A typical example is as follows:
Figure 1: Example of an enterprise-class environment
Regardless of whether the functionality of the application is hosting a company website or it is processing a critical overnight batch, every component is required in order to deliver the outcome.
In the same way that a car is not of much use without an engine, and an engine without a car is hardly convenient, if AIX is up but the application is down, functionality is affected.
Every component has some form of maintenance associated with it, whether it is related to software, hardware, or both. It is common for large organizations to have multiple teams looking after specific components, and from experience, it is not unusual for each team to have its own strategy for maintenance.
The result is that throughout the course of the year, service can be affected by multiple planned outages as a consequence of having multiple maintenance schedules.
Minimizing the number of planned outages increases service availability. Achieving this requires every team to consider the service as an entity as opposed to the specific component that their team manages.
Developing an enterprise maintenance strategy for AIX on Power (which addresses the key objectives), requires an understanding of the maintenance plans and schedules relating to every component. The outcome is to combine or rearrange activities so that a schedule of limited outages can be developed, in accordance with specific business needs.
A plain example
The following relates to a situation which was experienced when working with a very large IBM Power environment.
The environment consisted of multiple Power servers, each of which ran multiple AIX logical partitions (LPARs) for a variety of business units.
A planned outage that affected a Power server would impact multiple business units and therefore, outages were very difficult to arrange.
Each business unit was represented by an application owner who was responsible for service availability. One of the application owners represented the largest and most important applications for the company, and during one particular meeting, it was highlighted that recent outages included the following disruptions:
- IBM Power Systems™ firmware upgrade: Full Power Systems stop/start: Outage for the LPAR and service.
- Application release: Outage for the application and service
- AIX update: Reboot required: Outage for the LPAR and service
- Database update: Outage for the application and service
The application was in constant use around the world and any outage, planned or unplanned, had a significant impact.
This prompted other application owners to evaluate outages affecting their services and the outcome was similar: Multiple planned outages, each relating to a different component.
In each case, the application owners were asked at short notice for downtime, which was difficult for them to arrange with the business community.
The beginning of a combined strategy
It was not too long before the IT Service Manager arranged a meeting with the technical teams to discuss planned outages for the remainder of the year.
As the meeting progressed, it became clear that service-affecting work was being scheduled by almost all of the teams, and in most cases, in complete isolation to one another. An objective was set: Devise a calendar-based schedule, where updates and maintenance for the core components of the service were planned.
Communication was built into the schedule to request outages to the service far in advance. This eliminated the frustration associated with arranging outages at short notice.
The schedule incorporated flexibility to allow businesses to choose a suitable time within a four-month window for planned outages.
An annual system-wide outage was built in, reserved for disruptive tasks such as a Power firmware upgrade or hardware replacements that were not hot-swappable (such as processor replacement, and so on). The outage would only be used if required, but it was essential to cater for this type of maintenance, should the need arise.
Where possible, PowerHA failovers were used to minimize downtime before a maintenance window: clustered LPARs were 'failed-over' to the failover node so that services experienced a shorter outage related to the failover, as opposed to the longer outage associated with the maintenance. From IBM POWER6®, IBM PowerVM® Live Partition Mobility (LPM) became an additional tool for critical applications to maintain uptime, for moving LPARs away from the Power server on which maintenance was scheduled.
After finalizing the strategy, it became possible to offer the businesses the option to combine multiple components into a single planned outage per year, with the understanding that a second outage might become necessary if there were technical reasons to apply fixes.
The enterprise maintenance strategy for AIX on Power was defined within the attached spreadsheet:
The spreadsheet is organized from the top row, starting at January and working across to December.
Column A contains the components that collectively formed the service, with a schedule of activities to occur throughout the year. For the sake of minimizing the number of components, other software (such as PowerHA) was included in AIX Image Packaging and Delivering AIX Package.
The maintenance strategy included important aspects of managing service availability in an enterprise environment: The cycle of communication, evaluation of fixes, the testing and packaging of updates, and finally, deployment.
Known disruptive outages were defined as follows:
- Power Systems outage, February was the only planned disruptive upgrade to carry out essential tasks that require an outage to the Power server. For example, disruptive firmware upgrades, although concurrent updates were performed in preference. The planned disruptive outage was also a time to correct hardware failures that had occurred throughout the year and were not hot-swappable (processor failure, and so on).
- AIX update: Definite Technology Level update and a possible service pack update, with the option to combine with any other outage.
- Database application updates: Option to combine with any other outage.
- Application updates: Option to combine with any other outage.
From an AIX and VIOS perspective, the strategy was to apply one update per year, with the possibility of a second (service pack or important fix) update mid year. The mid-year outage was planned for in advance, but would only be used if it was considered necessary for technical reasons.
Aligning the maintenance activities of various teams meant that testing periods were extended, providing a more resilient base for the application.
Publishing the enterprise maintenance strategy to the business community provided benefits too. The individual businesses could plan their application testing and release to coincide with any other planned outages.
After implementing the enterprise maintenance strategy, it became clear that the infrastructure was being managed seamlessly from an enterprise perspective as opposed to an isolated component-based view.
The process of defining a maintenance strategy
The benefits of having an enterprise maintenance strategy include improved service availability by combining maintenance tasks, increased reliability through regular maintenance, and positively raising the profile of the technical teams responsible for managing the environment.
This section focuses on the processes and topics that assist with defining an enterprise maintenance strategy for AIX on Power.
Define a maintenance strategy owner
A calendar-based schedule is likely to result in one or more activities taking place every month. Defining a maintenance strategy owner (who is supported by management and the business community) ensures that the schedule remains on track.
The maintenance strategy owner is responsible for the following activities:
- Ensures that outages are planned with the businesses in advance.
- Provides communication to the business one month in advance to confirm impact of planned outages.
- Liaises with the technical teams to ensure that the testing and packaging phases are completed on schedule.
- Tracks the progress of updates for individual services and LPARs to ensure that every LPAR is being updated on time.
- Reviews the enterprise strategy annually so that it remains relevant and achievable.
Plan for realistic scenarios
When devising a strategy, consider business needs in relation to IBM AIX release strategy.
IBM releases one AIX Technology Level per year, approximately four service packs per year, and AIX Technology Levels receive fixes for three years.
Refer to AIX release and service delivery strategy for further information.
As a guide:
- Plan for at least one outage per year and recognize that additional outages are sometimes unavoidable.
Remain realistic: Emergency critical fixes that affect any one of the key components (AIX, database, application, and firmware) may need to be applied to the environment at short notice.
- Advising of "more uptime" is easier than requesting "more downtime".
This is important! If it is decided that the mid-year scheduled outage is no-longer necessary, it can be communicated to the business in advance, according to the schedule.
- Recognize planned outages as an important part of providing stability.
Under certain circumstances, it is necessary to arrange downtime in order to carry out planned disruptive software maintenance as well as replacing unexpected hardware failures. As the shared infrastructure becomes more complex, the need for a manageable maintenance strategy increases.
Communication is essential
- Define an outage window for known service disruption and communicate in advance.
The attached spreadsheet indicates that planned outages were arranged 6 months in advance, with a four-month window to schedule an outage. For example, communication was sent to the individual businesses in June for them to choose outages from anywhere between April and July the following year.
Some businesses preferred for everything to be updated at once (one planned outage per year) and other business units stipulated the requirement for each component to be updated individually, during separate outages. The choice was left with the individual business, and they understood the increased complexities associated with problem analysis when changing multiple components at the same time.
Advanced planning can also assist with scheduling technical resources to carry out maintenance. Trying to find technical resources on a Friday afternoon to perform upgrades on Saturday is not advanced planning!
Actively following the Maintenance Strategy keeps the technical environment consistent (everything on the same software level) and up-to-date (supported by vendors).
- Define an outage window for potential service disruption and communicate in advance.
The spreadsheet shows that a window for mid-year service pack updates was defined to apply unexpected fixes.
One month before the mid-year service pack update, it was evaluated whether or not an outage was necessary. In the case that a service pack or a critical fix was not required, notification was sent to the businesses one month in advance that no outage was necessary.
Again, it is easier to advise the business of "more uptime" than request "more downtime".
Testing and packaging of enterprise components
- Packaging VIOS, AIX, and database components
The maintenance strategy should include time for extended and thorough testing. This can improve confidence in the solution by reducing technical issues and minimizing downtime by fine tuning, rehearsing, and automating the update process on test LPARs.
- Host bus adapter (HBA) and Ethernet microcode:
Although frequently overlooked, adding adapter microcode to the maintenance schedule ensures a regular review against vendor interoperability matrices, for supportability and important fixes.
- Implementation - AIX, database, and firmware:
Implementation windows are defined to achieve the following tasks:
- Implement necessary updates with an option for individual businesses to combine components to reduce the number of planned outages, or arrange separate outages for each component.
- Introduce a tightly-managed environment. After a maintenance strategy is defined and followed, the technical environment as a whole becomes tightly managed by ensuring that all services are at the defined supported level. Managing fewer software levels leads to reduced testing and development time, as there will be fewer combinations to test.
- Realize that there will be exceptions:
It is inevitable that one or more LPARs will not be updated for a variety of reasons, such as application code restricts it to specific database or AIX level, or businesses are unable to commit to prearranged outages because of unforseen commercial reasons. These LPARs should be managed as exceptions, which can be reported on with the view to updating them as soon as possible.
IBM continually improves the reliability of services running on IBM Power servers by frequently releasing software and firmware updates for components including the Hardware Management Console (HMC), VIOS, AIX, PowerHA, Power server, adapter microcode, and so on. Some updates enable features, others correct known issues.
Regularly reviewing available software updates is an important consideration for an effective enterprise maintenance strategy, and this is simplified by taking advantage of IBM My Notifications subscription service.
The service has multiple configuration options that allow for receiving updates on topics relevant to a specific environment, at a frequency that is right for you.
For example, by using the subscription service, it is possible to set up daily or weekly email alerts for new system firmware on specific Power server types (Power 770, Power 780, and so on) as well as email alerts for specific AIX and VIOS versions. The emails are delivered to your inbox when a new update becomes available.
When IBM releases software or firmware, it is usually associated with a 'severity' which is an indication of the importance of the update, along with a description of the problem.
Notable entries are as follows:
|HIPER||High Impact/PERvasive||Should be installed as soon as possible.|
|SPE||SPEcial Attention||Should be installed at earliest convenience. Fixes for low-potential, high-impact problems.|
|ATT||ATTention||Should be installed at earliest convenience. Fixes for low-potential, low-to-medium impact problems.|
If required, refer to the full list of definitions.
The information that IBM provides when releasing software updates allows you to evaluate the risk to your environment and to be aware of the updates to include during scheduled maintenance.
Refer to the My Notifications website.
It is widely acknowledged that system maintenance is an integral part of managing an environment, for supportability and stability reasons.
With increased reliance on IT systems and with its demands for availability, minimizing the number and frequency of outages can be achieved by introducing an enterprise maintenance strategy, which encompasses the components of a service.
Reviewing the latest software and firmware updates is an important part of the enterprise management strategy. IBM simplifies this process by providing access to the My Notifications subscription service, to configure an easy way to receive the latest information that is relevant to your environment.
Outages can be avoided with, IBM PowerVM® Live Partition Mobility to relocate running services to an alternative Power server. Planned downtime can be minimized by high availability solutions so that clustered LPARs can be moved during a short planned window ahead of scheduled maintenance. These factors can be incorporated into the enterprise maintenance strategy that is relevant to your environment.
- IBM My Portal: Enables you to stay up-to-date with fixes by configuring a subscription service, manage service requests, and more.
- AIX release and service delivery strategy: Provides information relating to the release strategy of AIX.
- AIX Service Strategy Details and Best Practices: Provides a list of helpful resources and information relating to the components of AIX, such as technology levels, service packs, and so on.
- Fix Level Recommendation Tool: Enables you to access fixes and updates from IBM for any component, including adapter microcode, system firmware, AIX and VIOS. For AIX, there is a Compare Report option. A file that contains installed file sets is uploaded using the link and the report advises on recommended updates and fixes which are not currently applied.
- System Software Maps: Provides information on software levels in relation to AIX, IBM i, and PowerVM Virtual I/O Server.
- Power Code Matrix: Provides at-a-glance information relating to System firmware and HMC code.