"Nothing has changed, but suddenly a service has stopped working." My colleague assured me that no changes had been made on AIX. I had good reason to believe him, as he was a cautious, experienced sys admin, and was in the middle of a change freeze anyway in the leadup to Christmas. In Australia, it is common for companies to have skeleton staff and many people taking their summer holidays, so many companies keep any configuration changes to a minimum. For retail businesses which may have increased sales around the end of December, there is an equally good reason for change freeze.
Something was broken, but nothing had changed. So, what were the symptoms?
1. Four AIX VMs suddenly stopped connecting to an application service
2. All four VMs tried to connected to localhost, using the same port 7090 on each VM.
3. All VMs pointed to the same Domain Name System (DNS) servers which were shared internationally.
Where to start?
Clue #1: All VMs Broke At Once
The first thing to notice is that all four VMs broke at the same time. One afternoon they were working; the next morning they were not. This fact alone indicated that it wasn't a configuration change on just one of the VMs.
Clue #2: Nothing Else Broke
The rest of the environment was working perfectly. No network issues or timeouts. Users were able to connect via their clients as long as they didn't try port 7090 via the localhost. The main IP addresses for all VMs were pingable.
We checked the process for the application server was running. It was. And it was using port 7090, as this had not changed in the configuration. This showed the problem was not the service itself.
Then came the big clue: we tried a telnet to localhost on port 7090 and it failed:
telnet localhost 7090
telnet: connect: A remote host refused an attempted connect operation.
And yet if we did it by the VM's host name, the telnet worked. There was an entry in /etc/hosts for localhost and it pointed to 127.0.0.1, but how about bypassing name resolution altogether. We tried the IP address for localhost:
telnet 127.0.0.1 7090
Connected to 127.0.0.1.
Escape character is '^]'.
It worked! (How weird!)
Now a ping to localhost returned something very strange:
PING localhost.acme.com: (10.20.40.2): 56 data bytes
What was going on? localhost was resolving to some IP address on the company WAN instead of to 127.0.0.1.
From here, the solution was simple enough. We checked in the DNS and someone in another country had added an entry for a hostname "localhost". This had been propagated throughout the world and picked up when AIX tried to connect to the app server via the host name.
The dodgy entry in DNS explained why all VMs broke at once. It also explained why the problem had occurred overnight. It was overnight for the customer's local company but someone on the other side of the world had made this change during their own business hours.
What about /etc/hosts?
Now, I did point out that there was a valid entry in /etc/hosts for localhost:
127.0.0.1 loopback localhost # loopback (lo0) name/address
So why did AIX not look in /etc/hosts and ignore the spurious entry in DNS? Quite simple, really. By default, AIX looks in DNS, then NIS (not applicable in this case), then /etc/hosts. This order can be overwritten using the NSORDER variable or by editing the /etc/irs.conf, but my colleague had kept the default setup.
You can find out more about name resolution on AIX with the following link: