The first symptom
AnthonyEnglish 270000RKFN Visits (3987)
There are no coincidences.
Whenever I hear of any sort of problem at all, I wait to see if an entirely unrelated problem is reported soon after. This is especially worthwhile for performance issues, or when something suddenly stops working.
I came across a site where a printer was down. Within half an hour a controller error was reported from a disk subsystem which wasn't used by the system that had the printer go down. Coincidence? I was suspicious.
Temporary problem: permanent solution
We soon found out that the printer was down because all printers on the virutal machine (the LPAR) were unable to print And that was because you couldn't create any new files in /tmp. You could update existing files. This indicated the file system inode map was corrupted. That would require an fsck on /tmp, which would need a reboot of the LPAR.
We could fix the /tmp but that still didn't address the cause of what made the file system corrupt in the first place. Usually that indicates a link to the storage had been interrupted or the storage subsystem itself was damaged.
Sure enough, we soon found that many other AIX virtual machines and other Power or Wintel systems had storage errors at about the same time. Most of them had redundancy to the disk subsystems, but the ones that didn't have were the ones that suffered some sort of data or OS impact, such as the corrupt /tmp file system.
As it turned out, a SAN switch module had rebooted itself and was the cause of all the problems from disk subsystem error to file system corruption to printers being down. There was some work involved in repairing the switch module and removing it as a single point of failure. And it all began with the report that a printer was down.