Since February 2014, IBM Workload Automation is live on IBM Service Engage (www.ibmserviceengage.com), this is a SaaS (Software as a Service) offering based on IBM Tivoli Workload Scheduler (TWS) managed directly by TWS development team.
This offering is running on SoftLayer and, in order to keep low the operations cost, the environment is almost completely automated, of course using TWS itself.
Working daily on TWS development, we have automation in our DNA and when we planned our infrastructure we started from the automation. As part of the architecture, in addition to TWS installation used by our customers, we have an additional TWS instance that is used to manage the environment, creating and deleting customer subscriptions, creating and deleting new VMs for new environments, adding and removing new users. But recently we had found that we was missing an important scenario...
In the last month, with the several security alert like Shellshock, Heartbleed and Poodle, we had to make reboots of our machines several times after the installation of OS patches, an average of at least one reboot per month.
Right now we have about 30 machines to reboot, and in order to assure High Availability and not interrupt the service during the shutdown, the process is not easy, including failover of DB2, TWS master, etc... Even we was already using predefined procedures, it was costing at least one people day for each reboot, and due to the many steps, this was also subject to manual errors. We quickly realized that the the reboot was an urgent scenario to automate.
We started from the run book created for manual reboots and we had identified the following 19 jobs to complete the reboot of a single TWS environment.
In order to start the reboot process, we have added a button to the internal web interface we had created to manage the infrastructure. The button is actually calling a REST API running on TWS master WAS that using TWS official APIs is submitting the job stream to reboot the environment, passing during the submission the specific values of the selected environment, e.g. the IP addresses of the machines that are part of the environment.
High Availability architecture
High availability of TWS instances installed in SoftLayer is realized leveraging high availability of the different components. For each TWS installation we have two different virtual machine sets that we named Primary and Secondary. In general TWS master and the DB2 instance configured as primary run on Primary machine while TWS Backup Master and DB2 instance configured as standby run on on Secondary. The Dynamic Workload Console is installed in Cluster and the IBM HTTP Server performs load balancing among the two consoles installed on Primary and Secondary.
Description of the flow and jobs
In a high availability configuration machines have to be rebooted one a time while the other continues to work. Only the machine that is not currently working can be rebooted.
The process we implemented starts with the reboot of Secondary that is the the machine that at the beginning plays the role of backup. When it is completed the role of the machines is switched and after moving all the work to Secondary Primary can be rebooted. Once Primary is rebooted, the components on the machines are switched again to recreate the initial configuration.
At an high level the reboot process is composed by three logical steps that are repeated two times, one for each machine to reboot:
The machine is prepared for the reboot, that is all the applications have to be powered off gracefully
The machine is rebooted
The machine just rebooted is currently playing the role of backup. After reboot completion it has to play the role of primary system and manage all the incoming activities.
For each phase we wrote scripts performing all the needed actions in an automated way and we created one job stream with the correct dependencies to perform the actions in the correct sequence on the different systems.
Let's analyze in detail the sequence of the jobs in the REBOOT job stream.
This job is a step to be executed preliminary, before starting the procedure.
Given that Primary and Secondary are just names and what is important is to understand where the TWS flow is currently running. Actually it is possible that Primary and Secondary were switched for any reason. For this reason we check that the machines we call Primary are really playing the role of primary systems.
Before reboot it is a good practice to stop all the applications currently working, in our case TWS, DWC, DB2 and IHS. Note that the procedure to stop DB2 in high availability requires the deactivate and the stop of HADR.
Reboot: REBOOT_WAIT_SECONDARY_DOWN REBOOT_RESTART_SECONDARY and REBOOT_WAIT_SECONDARY_UP
Now that all applications have successfully stopped we can reboot our machines launching the REBOOT_RESTART_SECONDARY job that just performs the reboot command.
But how detect when the reboot is completed and the machines are up again?
The first thing that comes to mind is “ping!”. We wrote a WaitUpDown.sh script that returns true if the state of the machine is the expected one. We launch it two times, one to determine if the machines are down and one to check they are up again. In this way we wait until it went down and then up.
The structure of the job stream for this section is the following:
Note that REBOOT_WAIT_SECONDARY_DOWN and REBOOT_RESTART_SECONDARY jobs runs in parallel but to be sure that it starts before reboot, it is submitted with an higher priority.
Once both the jobs are completed, the REBOOT_WAIT_SECONDARY_UP job starts and waits until the machine is up again.
Roles switch: REBOOT_AFTER_SECONDARY_RESTART
Once the reboot has been completed we need to move all the activities from Primary to the Secondary machine. The first step is to restart the database high availability and move the primary role to the database installed on the Secondary.
Roles switch: REBOOT_SWITCH_MGR_TO_BKM
Now that the database on Secondary is ready we can run the switchmanager command from the master to make the backup . In addition the definition of master and backup master have to be changed on the database using composer.
Roles switch: REBOOT_START_BROKER_BKM
To continue to schedule on the agents the broker on the Secondary machine has to be started.
After completing all these steps Primary and Secondary machines have completely switched and now on Primary we have a backup master and DB2 is in Standby.
The remaining jobs are exactly the same but executed on the other machine.
With a day of work we was able to automate a process that was costing to us one or more people day each month, recovering the cost in few weeks. This is a confirmation that automation can save operations costs, and that the sooner you automate, the quicker you are on the road to saving more.
Franco Mossotto (@FMossottoTWA) - IBM Workload Automation SaaS Architect
Enrica Alberti - IBM Workload Automation SaaS DevOps Lead