Let urgent = urgent + 20 minutes
AnthonyEnglish 270000RKFN Visits (4761)
Around 7:15 one morning, a large Australian retailer called to say they couldn't access their production system, and most stores were opening at 9:00am. We had to go on site in those days, so it was 8 am before we got there. The operating system was accessible via the console, but it was in a very sick state as many major files had been simply wiped out.
Preparing for a Post mortem
We had to restore from a good mksysb and the store opening was approaching quickly. Before we did the restore, a colleague wanted to make a backup of the sick system if we could. I was against it because of the time it would take. He won the argument.
It took 20 minutes, and although we lost a little bit of time, the backup we took saved our skins. After recovering the live system, we restored the backup onto a spare system to find out the problem. There was a cleanup script called clean_dos which was the culprite (see Resources below). Without that tape which we used for a post mortem, we would never have known the cause. The system would have gone down again at 7 o'clock the next morning. All because I wanted a quick fix.
Better 20 minutes late than dead on time
How often we prefer the quick fix, which ends up being a slow and torturous path to destruction. Instead of fixing the pot hole, we put up another detour sign. Well, a road made of detours leads to sys admin oblivion, because
there's nothing so permanent
In AIX terms, fix them workarounds.
Very often these temporary fixes are only because someone shouted out "Urgent!" and you didn't add 20 minutes to their expected resolution time.
Here are some common temporary fixes which end up being set in stone. A little thought and time would prevent them making systems unstable, insecure or unmanageable:
Prevention is better than cure, but if I do need to do a quick workaround, I like to follow this rule:
Every temporary fix should have a permanent undo scheduled within a reasonable time frame (plus 20 minutes)You might like to fix some of these little workarounds, tidy up the cron and clean up those old file systems. You could even write a little script to do some of these things. Just don't call it clean_dos, OK?
And you can quote me
If it looks vain to be referring you back to my previous blog posts, then it's either because: