Earlier today I received email from a customer reporting their large POWER7 based machines where on firmware 720_64 to 720_90 and their reluctance to take the outage to upgrade it. They were asking for fine details of newer firmware levels and what advantages this would bring to "justify the outage to their user departments".
To be blunt this is a horror story:
I lay awake at night in a cold sweat about stories like this. The customer has the whole "running computers plan upside down". The question should be "can I risk not having the latest firmware
" and user departments should expect either:
- High Availability switch overs (few minutes outage) or
- Live Partition Mobility (zero minutes outage) or
- Regular planned outage (an hour).
If the computer stops and it has firmware that is 2 years out of date ... it is the computer departments fault!
Particularly, if the firmware dates back to the initial release - this level would not even have CHARM enabled! As that arrives roughly 3 to 6 months after GA.
IBM does not issue new System Firmware levels for the fun of it! It costs IBM many millions of dollars and man-years in effort in coding and particularly testing. Sure they make available new Hypervisor functions that you could regard as optional but the bulk of it is to increase reliability, ensure the integrity of your system, the data in memory and the contents of your disks (to avoid corrupting your RDBMS), to avoid unexpected outages and to increase performance.
Large machines running POWER Systems Firmware 720_064 are on the initial availability releases of firmware dating back to September 2010 - running this in my humble opinion "bonkers". Take a metaphor from maintaining your car:
- When a red light starts blinking on the dashboard of your car - Do you ignore it for two years?
- What if you have 4 red lights, 2 orange lights and the words "Warning!!" flashing on the dashboard?
- What would it take to get you to take the car to the garage to get it fixed? A complete break down 50 miles from town? In computer terms, a failure that takes days to fix. A major life threatening accident with your family in the car? In computer term, you are sacked for incompetence!
The details of POWER7 Systems Firmware history and fixes can be found here:
Low-End - ftp://ftp.boulder.ibm.com/software/server/firmware/AL-Firmware-Hist.html
Mid-range - ftp://ftp.boulder.ibm.com/software/server/firmware/AM-Firmware-Hist.html
High-End - ftp://ftp.boulder.ibm.com/software/server/firmware/AH-Firmware-Hist.html
Excellent webpage that defines all the terms we use in POWER land:
Exercise for the student
Find your current Firmware Level then go up the web-page counting the number of "Severity: HIPER
- High Impact/PERvasive, Should be installed as soon as
" warnings to the current best level. This is the number of red lights on the dashboard.
- My customer has five red lights on the dashboard and they are asking "Do we really need to update?". My customer is "bonkers". If they have an "incident" on these machines, the urge to say "we told you so!" will be overpowering.
- I am told they can non-disruptive update their 720 firmware to 720_108 the minimum sane level without downtime.
- I have not checked the details but that would get them quickly to a higher safer level - late last year we recommended 720_101 had all sorts of improvements that the techies regarded as mandatory. I can't check the update process by trying this because none of my 20 machines are this back level, of course.
- The same goes for getting to the latest 730 firmware from early 730 levels.
The Smart Money:
- Changes from 720 to 730 are disruptive - but I would very highly recommend this too and plan this in the next update cycle and certainly before the end of the year at the very latest. There are many fixes and performance tweaks in the 730 Firmware that are not in the 720 levels, that I would highly recommend and especially for the large machines Power 770, Power 780 and Power 795.
- I would put this in the "don't phone or email me with a performance issue unless you are on the latest 730 firmware because you would be wasting my time addressing already fixed issues".
- I know System Firmware updates are boring and hard to schedule but it is not an option but ignoring the problem is not an option either. Just like driving your car with your eyes closed so you can ignore the red lights on the dashboard is not an option.
- Live Partition Mobility has been available for 5 years and HACMP (System Mirror) available for decades and these make it relatively painless.
- If you don't regularly update firmware, you are playing a high stakes poker game with your employers assets and company profits, and your own career.
- I don't think I can express that more clearly.