Around 7:15 one morning, a large Australian retailer called to say they couldn't access their production system, and most stores were opening at 9:00am. We had to go on site in those days, so it was 8 am before we got there. The operating system was accessible via the console, but it was in a very sick state as many major files had been simply wiped out.
Preparing for a Post mortem
We had to restore from a good mksysb and the store opening was approaching quickly. Before we did the restore, a colleague wanted to make a backup of the sick system if we could. I was against it because of the time it would take. He won the argument.
It took 20 minutes, and although we lost a little bit of time, the backup we took saved our skins. After recovering the live system, we restored the backup onto a spare system to find out the problem. There was a cleanup script called clean_dos which was the culprite (see Resources below). Without that tape which we used for a post mortem, we would never have known the cause. The system would have gone down again at 7 o'clock the next morning. All because I wanted a quick fix.
Better 20 minutes late than dead on time
How often we prefer the quick fix, which ends up being a slow and torturous path to destruction. Instead of fixing the pot hole, we put up another detour sign. Well, a road made of detours leads to sys admin oblivion, because
there's nothing so permanent
as a temporary solution!
In AIX terms, fix them workarounds.
Very often these temporary fixes are only because someone shouted out "Urgent!" and you didn't add 20 minutes to their expected resolution time.
Here are some common temporary fixes which end up being set in stone. A little thought and time would prevent them making systems unstable, insecure or unmanageable:
- Breaking your security to allow easy access. For example
- Creating multiple users with id 0 (root authority)
- Removing password restrictions (maxage is a common one)
- Enabling services which your security policy says should be blocked (ftp)
- Using local authentication when you could outsource your passwords to AD using Kerberos (see Resources)
- NFS Mounts, Samba shares for quick access to Prod data
- How many spurious entries do you have in /etc/exports?
- Do you have a chicken and egg configuration for NFS mounts, e.g.
- File system A depends on B
- B depends on C
- C depends on A
- Temporary and obsolete entries in /etc/hosts
- VIO server local disk in a SAN / dual VIOS setup. That little bit of extra space you assigned from one VIO makes it a single point of failure.
- Ancient cron entries to fix permissions, copy unnnecessary data and send emails to admins who left in 1960
- Alerting emails which end up in email clients' junk boxes - if you're not acting on it, you shouldn't receive it
Prevention is better than cure, but if I do need to do a quick workaround, I like to follow this rule:
Every temporary fix should have a permanent undo scheduled within a reasonable time frame (plus 20 minutes)You might like to fix some of these little workarounds, tidy up the cron and clean up those old file systems. You could even write a little script to do some of these things. Just don't call it clean_dos, OK?
- For a brief history of that infamous script called clean_dos see the section "Be aware of dependencies" from Cutting your cron down to size
- If you're wondering how to recover an AIX system see 'rm -r' and your career
- Some tips about automating - The virtue of laziness
- Kerberos authentication
And you can quote me
If it looks vain to be referring you back to my previous blog posts, then it's either because:
- I want you to learn
from my mistakes
- I'm following the advice of George Bernard Shaw, who said: "I often quote myself. It adds spice to my conversation."