Network problem determination: AIX tools for a system administrator

Part 2, Detailed diagnosis and troubleshooting

Comments

Content series:

This content is part # of # in the series: Network problem determination: AIX tools for a system administrator

Stay tuned for additional content in this series.

This content is part of the series:Network problem determination: AIX tools for a system administrator

Stay tuned for additional content in this series.

This article provides you with a set of commands available on IBM AIX®, many of which are also available on other flavors of UNIX®, that can help you get as much information as you can about exactly what is going on when your host has problems communicating with another. It also provides a logical step-by-step approach to diagnosing common issues.

For the purposes of this article, the target host system used in all sample commands and output is called testhost.

Tell me more

Depending on the nature of the network problem you're diagnosing, it's sometimes worth investigating whether the failing application or command has any kind of verbose, trace, or debug options. For example, both the ssh (Secure Shell) and scp (Secure Copy) commands have a verbose switch (-v) that can provide you with an extensive trace of the communication, key exchange, and authentication that takes place between client and server (see Listing 1).

Listing 1. Connecting to a remote host with a verbose ssh session
# ssh —v testhost 
OpenSSH_4.2p1, OpenSSL 0.9.7d 17 Mar 2004
debug1: Reading configuration data /opt/freeware/etc/ssh_config
debug1: Connecting to testhost [10.217.1.206] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file /root/.ssh/identity type -1
debug1: identity file /root/.ssh/id_rsa type 1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: Remote protocol version 1.99, remote software version OpenSSH_4.1
debug1: match: OpenSSH_4.1 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_4.2
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-cbc hmac-md5 none
debug1: kex: client->server aes128-cbc hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug1: Host 'testhost' is known and matches the RSA host key.
debug1: Found key in /root/.ssh/known_hosts:14
debug1: ssh_rsa_verify: signature correct
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,password,keyboard-interactive
debug1: Next authentication method: publickey
debug1: Trying private key: /root/.ssh/identity
debug1: Offering public key: /root/.ssh/id_rsa
debug1: Authentications that can continue: publickey,password,keyboard-interactive
debug1: Trying private key: /root/.ssh/id_dsa
debug1: Next authentication method: keyboard-interactive
debug1: Authentications that can continue: publickey,password,keyboard-interactive
debug1: Next authentication method: password
root@testhost's password:
debug1: Authentication succeeded (password).
debug1: channel 0: new [client-session]
debug1: Entering interactive session.
Last unsuccessful login: Wed 27 Jan 13:30:23 2010 on ssh from 10.216.163.37
Last login: Wed 10 Feb 16:05:48 2010 on /dev/pts/0 from 10.216.163.37
******************************************************************************
*                                                                             *
*                                                                             *
*  Welcome to AIX Version 5.3!                                                *
*                                                                             *
*                                                                             *
*  Please see the README file in /usr/lpp/bos for information pertinent to    *
*  this release of the AIX Operating System.                                  *
*                                                                             *
*                                                                             *
******************************************************************************
#

If you have login access to the problematic host (ideally the server failing to service network requests to a particular port, although sometimes errors can be reported on the requesting client as well), you should check system logs for relevant messages. These include files such as /var/adm/messages, /var/log/syslog, and /var/log/mail, depending on how the system logging daemon is configured in /etc/syslog.conf, as well as daemon-specific logs if any exist (for example, ftpd, sshd, telnetd). It's often the case that warnings, errors, or failures are logged in one or more of these logs. Therefore, they're a good place to look for information that might help identify root cause.

Some services allow for the configuration of verbose, debug, or trace-type logging so that more than the standard informational or error messages are logged. If the problem is reproducible, it's worth investigating the potential to use this type of diagnostic logging for the duration of the testing. However, it's not advisable to keep verbose logging on indefinitely, as doing so can cause disk and file system space issues.

To establish a better idea of what a process is doing, you can use truss to trace the system calls that a process makes. The truss command can either execute a specified command or attach to an existing process to produce a trace (assuming that you own the running process or have root privileges). In the case of the latter, you can stop the trace at any time by pressing Control-C.

Listing 2 shows an extract of a basic trace of the command, ssh testhost, along with a short extract. The -l switch prefixes each trace entry with the process ID, while the -d switch displays a timestamp relative to the start of the trace.

Listing 2. Basic system call trace of a command
# truss —ld ssh testhost

2785347: 0.0000: execve("/usr/bin/ssh", 0x2FF22B70, 0x2FF22B7C) argc: 2
2785347: 0.0137: __loadx(0x03020000, 0x2FF22A40, 0x00000080, 0xDEADBEEF, 
                 0xDEADBEEF) = 0x00000000
2785347: 0.0141: __loadx(0x0C000000, 0xF0208964, 0xF1422004, 0xF020832C, 
                 0x00000001) = 0x00000000
2785347: 0.0143: thread_init(0x0000000000000000, 0x00000000D011A9BC) = 
2785347: 0.0146: sbrk(0x00000000)	= 0x20015B5C
2785347: 0.0148: vmgetinfo(0x2FF22958, 7, 16) = 0
2785347: 0.0151: sbrk(0x00000000)	= 0x20015B5C
2785347: 0.0153: vmgetinfo(0x2FF22410, 7, 16) = 0
2785347: 0.0156: sbrk(0x00000000)	= 0x20015B5C
2785347: 0.0158: sbrk(0x00000004)	= 0x20015B5C
2785347: 0.0160: __libc_sbrk(0x00000000)	= 0x20015B60
2785347: 0.0163: getrpid(-1, -1, 10)	= 475322
2785347: 0.0165: _getpid()	 = 475322
.
.
.
.
2785347: 35.9732: kioctl(0, 1074295912, 0x2FF22520, 0x00000000) = 0
2785347: 35.9735: getsockopt(3, 6, 1, 0x2FF22554, 0x2FF22550) = 0
2785347: 35.9737: setsockopt(3, 6, 1, 0x2FF22554, 4) = 0
2785347: 35.9739: ngetsockname(3, 0x2FF22498, 0x2FF22490) = 0
2785347: 35.9741: setsockopt(3, 0, 3, 0x2FF22560, 4) = 0
2785347: 35.9743: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kwrite(3, " H14 l95121D i86 H Q o10".., 384) = 384
2785347: 35.9748: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kread(3, " t x "0699841A E a S y\n".., 8192) = 768
2785347: 35.9837: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 2

Last unsuccessful login: Sat 13 Feb 22:16:28 2010 on ssh from myhost.testdomain.com
Last login: Sat 13 Feb 22:16:56 2010 on /dev/pts/4 from myhost.testdomain.com
*******************************************************************************
* *
* *
* Welcome to AIX Version 5.1! *
* *
* *
* Please see the README file in /usr/lpp/bos for information pertinent to *
* this release of the AIX Operating System. *
* *
* *
*******************************************************************************
2785347: kwrite(5, " x ".., 567) = 567
2785347: 35.9849: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kread(3, " x d o e x10 # 0 A1C c17".., 8192) = 48
2785347: 36.1103: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kwrite(5, " t e s t h o s t : r o o".., 17) = 17
testhost:root> 
2785347: 219.4781: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) 
                   (sleeping...)
2785347: 219.4781: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kread(4, "04 n a80 n V\f a\0\0\010".., 16384) = 1
2785347: 220.1322: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kwrite(3, " O8D d r 013 g1982 o\n i".., 48) = 48
2785347: 220.1327: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kread(3, " p h1A 1 a I J E031D9D1C".., 8192) = 80
2785347: 220.1347: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1

2785347: kwrite(5, "\r\n", 2)	 = 2
2785347: 220.1352: close(5)	 = 0
2785347: 220.1354: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kread(3, " O a *901C ^81 . B e83 R".., 8192) = 96
2785347: 220.1358: close(4)	 = 0
2785347: 220.1360: kioctl(0, 22528, 0x00000000, 0x00000000) = 0
2785347: 220.1362: kioctl(0, 21507, 0x200151F8, 0x00000000) = 0
2785347: 220.1365: close(6)	 = 0
2785347: 220.1367: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1
2785347: kwrite(3, "1A | B O # E c v9D e93 >".., 32) = 32
2785347: 220.1372: sigprocmask(2, 0xF1423790, 0x2FF22630) = 0
2785347: 220.1374: _sigaction(28, 0x00000000, 0x2FF226E0) = 0
2785347: 220.1375: thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0x102A8043, 
                   0x00000000, 0x00000135, 0x00000000, 0x00000000) = 0x00000000
2785347: 220.1377: sigprocmask(2, 0xF1423790, 0x2FF22630) = 0
2785347: 220.1379: _sigaction(28, 0x2FF226D0, 0x00000000) = 0
2785347: 220.1381: thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0x102A8043, 
                   0x00000000, 0x0000017C, 0x00000000, 0x00000000) = 0x00000000
2785347: 220.1383: kioctl(0, 22528, 0x00000000, 0x00000000) = 0
2785347: 220.1385: kioctl(1, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
2785347: 220.1387: kfcntl(1, F_GETFL, 0x00000000) = 67108869
2785347: 220.1389: kioctl(1, -2147195266, 0x2FF22640, 0x00000000) = 0
2785347: 220.1391: kioctl(1, -2147195267, 0x2FF22640, 0x00000000) = 0
2785347: 220.1393: kfcntl(1, F_SETFL, 0x04000001) = 0
2785347: 220.1395: kioctl(2, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
2785347: 220.1397: kfcntl(2, F_GETFL, 0x00000000) = 67108865
Connection to testhost closed.
2785347: kwrite(2, " C o n n e c t i o n t".., 32) = 32
2785347: 220.1402: shutdown(3, 2)	= 0
2785347: 220.1404: close(3)	 = 0
2785347: 220.1406: kfcntl(1, F_GETFL, 0x102A8043) = 67108865
2785347: 220.1408: kfcntl(2, F_GETFL, 0x102A8043) = 67108865
2785347: 220.1410: _exit(0)
#

Listing 3 shows a more verbose trace of a running process (process ID 976) to an output file along with a short extract. The following switches were used:

Table 1. Switches used for more verbose trace of a running process
-oSpecifies that the trace is written to an output file (/var/tmp/truss.out)
-aDisplays any parameters passed to a system call
-eDisplays any environment strings passed to a system call
-fFollows all children created by the fork system call and includes their signals, faults, and system calls in the trace output
-lLists the process ID that made the call
-DDisplays a time delta representing the elapsed time since the previous event
-r allShows the contents of the I/O buffer for Read calls
-w allShows the contents of the I/O buffer for Write calls
-x allDisplays data from specified parameters of system calls in hex format
-pSpecifies the running process you want to trace (976, in the example shown)
Listing 3. More verbose trace of a running process to a file
# truss —o /var/tmp/truss.out —aeflD —r all —w all —x all —p 976
^C

# cat /var/tmp/truss.out
1003752:   psargs: sshd: testuser@pts/1 AIA,, ts/1
1003752:   1798193: 0.0000:        _select(0x00000008, 0x2002D788, 0x2002D798, 
                                   0x00000000, 0x00000000) (sleeping...)
1003752:   1798193: 0.0000:        _select(0x00000008, 0x2002D788, 0x2002D798, 
                                   0x00000000, 0x00000000) = 0x00000001
1003752:   1798193: 0.7196:        sigprocmask(0x00000000, 0x2FF22598, 0x2FF225A0) 
                                   = 0x00000000
1003752:   1798193: 0.0002:        sigprocmask(0x00000002, 0x2FF225A0, 0x00000000) 
                                   = 0x00000000
1003752:   1798193: kread(0x00000003, 0x2FF1E590, 0x00004000) = 0x00000034
1003752:    ? 2 q q A> Ao A' A,8E ) A, Au A?9D 8 A,87 {90 A1 A^ l p 0 !02 A— A% A4 A!9C\n
1003752:    A| | &amp;8E A!9F G A2 )1C M1E A^ AZ / AE p AI Az A
1003752:   1798193: 0.0003:        _select(0x00000008, 0x2002D788, 0x2002D798, 
                                   0x00000000, 0x00000000) = 0x00000001
1003752:   1798193: 0.0003:        sigprocmask(0x00000000, 0x2FF22598, 0x2FF225A0) 
                                   = 0x00000000
1003752:   1798193: 0.0002:        sigprocmask(0x00000002, 0x2FF225A0, 0x00000000) 
                                   = 0x00000000
1003752:   1798193: kwrite(0x00000006, 0x200A6F98, 0x00000001) = 0x00000001
1003752:	   p
1003752:   1798193: 0.0003:        kioctl(0x00000006, 0x00005800, 0x00000000, 
                                   0x00000000) = 0x00000000
1003752:   1798193: 0.0002:        kioctl(0x00000006, 0x00005401, 0x2FF224C0, 
                                   0x00000000) = 0x00000000
1003752:   1798193: 0.0003:        _select(0x00000008, 0x2002D788, 0x2002D798, 
                                   0x00000000, 0x00000000) 
                                   = 0x00000001
1003752:   1798193: 0.1359:        sigprocmask(0x00000000, 0x2FF22598, 
                                   0x2FF225A0) = 0x00000000
1003752:   1798193: 0.0002:        sigprocmask(0x00000002, 0x2FF225A0,
                                   0x00000000) = 0x00000000
1003752:   1798193: kread(0x00000007, 0x2FF1E4F0, 0x00004000) = 0x00000013
1003752:      t e s t h o s t : t e s t u s e r >  
.
.
.
.
1003752:   1798193: kread(0x00000007, 0x2FF1E4F0, 0x00004000) = 0x0000003F
1003752:	                                                                  
1003752:	                                               M o z i l l a\r\n
1003752:   1798193: 0.0003:        _select(0x00000008, 0x2002D788, 0x2002D798, 
                                   0x00000000, 0x00000000) = 0x00000002
1003752:   1798193: 0.0003:        sigprocmask(0x00000000, 0x2FF22598, 
                                   0x2FF225A0) = 0x00000000
1003752:   1798193: 0.0002:        sigprocmask(0x00000002, 0x2FF225A0, 
                                   0x00000000) = 0x00000000
1003752:   1798193: kread(0x00000007, 0x2FF1E4F0, 0x00004000) = 0x00000051
1003752:	                                                                  
1003752:	                                                                  
1003752:	       t r u s s . t x t\r\n f p d g
1003752:   1798193: 2.0001:        _select(0x00000008, 0x2002D788, 0x2002D798, 
                                   0x00000000, 0x00000000) (sleeping...)
^C
#

You can use the system call trace to identify errors that may be the potential root cause of your problem. Look for calls marked with #Err, indicating a non-zero return code, which you can look up in /usr/include/sys/errno.h. You can also use it to identify potential performance delays by looking for long deltas between calls when using the -D switch. For example, the elapsed time between the last two events in the sample trace output in Listing 3 finished was 2.0001 seconds.

Houston, we have a problem!

Now that you have a useful toolkit of network diagnostic aids, it's time to look at a logical, step-by-step approach to troubleshooting common problems. The following section lists a number of common AIX network-related issues and provides a guide for diagnosis and what to look for in each one.

Host unknown

If a host name being used on a command or by an application isn't recognized, check the search order in which names are resolved by looking at the hosts record in /etc/irs.conf and /etc/netsvc.conf. For a host to be referenced by name, it has to be resolved through name resolution.

If local is specified in the hosts record, look for the host name in the /etc/hosts file. If you look at the example in Listing 4, you can see a simple grep of the host testhost from this file returning a successful match. Your host name must appear in any of the fields after the first field (the IP address). In the example shown, the server is also known by two aliases: testhost.testdomain.com and aixserver. This means that you can refer to this particular host by any of those three names when it comes to using commands that require a host name argument.

Listing 4. Looking for a host in /etc/hosts
# grep testhost /etc/hosts
10.217.1.206    testhost testhost.testdomain.com aixserver
#

If bind or dns is specified in the hosts record, use nslookup to ensure that the host name resolves through DNS. If you look at the example in Listing 5, you can see that resolution has been successful and the DNS server testdns.testdomain.com (shown with its IP address) has returned a known IP address for the host testhost of 10.217.1.206.

Listing 5. Resolving a host name via DNS
# nslookup testhost
Server:  testdns.testdomain.com
Address:  158.177.79.90
 
Name:    testhost.testdomain.com
Address:  10.217.1.206
#

Any additional name resolution services specified in your configuration files are outside the scope of this document and won't be discussed here.

Unresponsive host

If a host is known but you find that users are complaining that the host itself or an application running on it isn't responding, use ping and look for 0% packet loss (see Listing 6). Anything else means that there could be a problem with the target host or the network.

Listing 6. Pinging a responsive host
# ping testhost
PING testhost: (10.217.1.206): 56 data bytes
64 bytes from 10.217.1.206: icmp_seq=0 ttl=253 time=0 ms
64 bytes from 10.217.1.206: icmp_seq=1 ttl=253 time=0 ms
64 bytes from 10.217.1.206: icmp_seq=2 ttl=253 time=0 ms
64 bytes from 10.217.1.206: icmp_seq=3 ttl=253 time=0 ms
 
----testhost PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0/0/0 ms
#

Also, look for long response times in the time= field or a spike in the values reported. Both of these can indicate poor network or host performance, which may be causing the application that your users are complaining about to time out.

Ensure that there's a route out to the target host by using route get (see Listing 7), verifying with the network administrator that this is the correct gateway to use.

Listing 7. Getting routing table information for a host
# route get testhost

      route to: testhost
   destination: 10.203.35.128
          mask: 255.255.255.128
       gateway: 10.203.35.1
     interface: en2
   interf addr: myhost
         flags: <UP,GATEWAY,DONE,PRCLONING>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
        0         0         0         0         0         0        0   -9751026
#

Use ifconfig (see Listing 8) to make sure that the interface reported is configured to AIX and showing as UP and RUNNING.

Listing 8. Displaying network interface status
# ifconfig en1
en1: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,
	CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>
        inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255
         tcp_sendspace 131072 tcp_recvspace 65536

# ifconfig -a
en2: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,
  	CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>
        inet 10.203.35.14 netmask 0xffffff80 broadcast 10.203.35.127
en1: flags=7e080863,40<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,
	CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>
        inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255
         tcp_sendspace 131072 tcp_recvspace 65536
en0: flags=7e080822,10<BROADCAST,NOTRAILERS,SIMPLEX,MULTICAST,GROUPRT,64BIT,
	CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG>
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
        inet6 ::1/0
         tcp_sendspace 65536 tcp_recvspace 65536
#

Use ping to check that the gateway reported by ifconfig is contactable from your host. If it isn't, there may be a problem with the physical connection from the network adapter on your host to the gateway (for example, a faulty switch port, cable, network card).

Use traceroute (see Listing 9) to trace the full route to and from the target host. A route to a host doesn't necessarily mean a route back, so check with the network administrator to ensure that both exist and the application traffic being sent or received isn't being blocked by any firewalls. If any of the hops on the route returns no data (marked by asterisks [*]), it could indicate a problem with the routing. Although, it might be that the packets that traceroute uses to trace the route may also be blocked by firewalls. The output from this command should help the network administrator determine whether there's a real routing issue.

Listing 9. Tracing a successful route to a host
# traceroute testhost
trying to get source for testhost
source should be 10.216.163.37
traceroute to testhost (10.217.1.206) from 10.216.163.37 (10.216.163.37), 30 hops max
outgoing MTU = 1500
 1  10.216.163.2 (10.216.163.2)  1 ms  0 ms  0 ms
 2  10.217.189.6 (10.217.189.6)  0 ms  0 ms  0 ms
 3  testhost (10.217.1.206)  1 ms  1 ms  1 ms
#

Unresponsive TCP port

If a host is known and responding to ping but a particular TCP port used by an application or remote service doesn't appear to be, use telnet to try and make a connection to the specific port on the target host using the example shown in Listing 10, which attempts to connect to port 25 on host testhost.

Listing 10. Testing port 25 (SMTP) on a host (successful)
# telnet testhost 25
Trying...
Connected to testhost.
Escape character is '^]'.
220 testhost.testdomain.com ESMTP Sendmail Wed, 10 Feb 2010 15:52:28 GMT
^]
telnet> quit
Connection closed.
#

Common ports are listed in /etc/services. A successful connection should result in the message Escape character is '^]' and optionally a message from the remote service, such as the mail server, shown in Listing 10. If no such messages are received and the connection times out or is refused, then check with the network administrator that there are no firewalls en route blocking the type of traffic being sent. Also, check with the systems administrator of the target host that the application server or remote service is running and listening on the specified port and that firewalls running on the host are not blocking traffic.

Not connecting to a responsive TCP port

If a host is known, responding to ping, and responding on a particular TCP port to other hosts but not yours, use telnet to try and make a specific connection to the specific port following the logic shown in Unresponsive TCP port.

Use netstat to look for connections to the host and their state using the second example shown in Listing 11, which looks for all connections to a particular IP address and port (port 22 at 10.217.1.206). Unless the connection state is shown as ESTABLISHED, the connection is either still being made or has been terminated. For example, a status of SYN_SENT indicates that a three-way handshake has been initiated by your host, but as yet no acknowledgement has been received from the target host. This could mean that there's a route to the target but no route back for this type of traffic. In this situation, ask the network administrator whether any firewalls on the route back are blocking this type of traffic.

Listing 11. Displaying the status of connections to hosts
# netstat -an | grep 10.217.1.206
tcp4       0      0  10.203.35.14.22        10.217.1.206.1023      ESTABLISHED
tcp4       0      0  10.203.35.14.46183     10.217.1.206.22        ESTABLISHED

# netstat -an | grep 10.217.1.206.22
tcp4       0      0  10.203.35.14.46183     10.217.1.206.22        ESTABLISHED

# netstat -an | grep ESTABLISHED
tcp4       0      0  10.203.35.14.22        10.217.1.206.1023      ESTABLISHED
tcp4       0      0  10.203.35.14.46183     10.217.1.206.22        ESTABLISHED
tcp4       0      0  10.216.163.37.1521     10.216.163.37.44122    ESTABLISHED
tcp4       0      0  10.216.163.37.44122    10.216.163.37.1521     ESTABLISHED
tcp4       0      0  127.0.0.1.199          127.0.0.1.32769        ESTABLISHED
tcp4       0      0  127.0.0.1.32769        127.0.0.1.199          ESTABLISHED
tcp4       0      0  10.203.35.14.46183     10.203.35.170.22       ESTABLISHED
tcp4       0      0  10.216.163.37.32770    10.216.163.37.32771    ESTABLISHED
#

Use tcpdump to display packets sent to and received from the host on the specified port using the example shown in Listing 12. If only packets sent by your host are shown, this is another indication that the problem is with traffic sent back by the target and therefore the route back.

Listing 12. Display packets destined for or sent by a specific host on a specific port
# tcpdump -i en2 host testhost port 22
12:15:38.033833162 myhost.47216 > testhost.22: . ack 610148954 win 17520 (DF) [tos 0x10]
12:15:38.113807903 myhost.47216 > testhost.22: P 145:193(48) ack 192 win 17520 (DF) 
					       [tos 0x10]
12:15:38.114291921 testhost.22 > myhost.47216: P 192:240(48) ack 193 win 24820 (DF) 
					       [tos 0x10]
12:15:38.241718122 myhost.47216 > testhost.22: P 193:241(48) ack 240 win 17520 (DF) 
					       [tos 0x10]
12:15:38.242344703 testhost.22 > myhost.47216: P 240:288(48) ack 241 win 24820 (DF) 
					       [tos 0x10]
12:15:38.243844593 myhost.47216 > testhost.22: . ack 288 win 17520 (DF) [tos 0x10]
12:15:38.497817604 myhost.47216 > testhost.22: P 241:289(48) ack 288 win 17520 (DF) 
					       [tos 0x10]
12:15:38.503088328 testhost.22 > myhost.47216: P 288:336(48) ack 289 win 24820 (DF)
					       [tos 0x10]
12:15:38.503154802 testhost.22 > myhost.47216: P 336:432(96) ack 289 win 24820 (DF)
					       [tos 0x10]
^C
145 packets received by filter
0 packets dropped by kernel
#

Long login times

If users are complaining that response times logging in to a particular host are slow, log in to the host and use dig to perform a reverse lookup of the IP address of the user's computer using the example shown in Listing 13.

Listing 13. Reverse lookup of an IP address in DNS
# dig -x 10.217.1.206
; <<>> DiG 9.2.0 <<>> -x 10.217.1.206

;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21351
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;206.1.217.10.in-addr.arpa.    IN      PTR

;; ANSWER SECTION:
206.1.217.10.in-addr.arpa. 3600 IN     PTR     testhost.testdomain.com.

;; Query time: 11 msec
;; SERVER: 10.217.1.206#53(10.217.1.206)
;; WHEN: Fri Feb 12 13:28:16 2010
;; MSG SIZE  rcvd: 82
#

During login, a host may perform a reverse lookup on the IP address of the source address on the packets it receives. Depending on the configuration of that host, it can take some time for the lookup to fail and the login to continue. This process appears to the user as a long login delay.

Look in the ANSWER SECTION of the output that dig returns—specifically, the PTR record, which will be a pointer to the host name. If none is returned, this could explain the delay.

MAC address query

If an application requires a MAC address—for example, one where an ACL is based on one, a firewall has rules based on one, or a configuration requires one (for example, KickStart or JumpStart installation services on Linux® and Sun® Solaris®)—use entstat to find out the MAC address of a local interface using the example shown in Listing 14, looking for the Hardware Address (aka MAC address).

Listing 14. Displaying Ethernet statistics for a network adapter
# entstat -d en2
-------------------------------------------------------------
ETHERNET STATISTICS (en2) :
Device Type: 10/100/1000 Base-TX PCI-X Adapter (14106902) 
Hardware Address: 00:02:55:d3:37:be 
Elapsed Time: 114 days 22 hours 48 minutes 20 seconds

Transmit Statistics:           Receive Statistics:
--------------------           -------------------
Packets: 490645639             Packets: 3225432063
Bytes: 9251643184881           Bytes: 215598601362
Interrupts: 0                  Interrupts: 3144149248
Transmit Errors: 0             Receive Errors: 0
Packets Dropped: 0             Packets Dropped: 0
                               Bad Packets: 0

Max Packets on S/W Transmit Queue: 109 
S/W Transmit Queue Overflow: 0
Current S/W+H/W Transmit Queue Length: 0

Broadcast Packets: 442         Broadcast Packets: 10394992
Multicast Packets: 0           Multicast Packets: 349
No Carrier Sense: 0            CRC Errors: 0
DMA Underrun: 0                DMA Overrun: 0
Lost CTS Errors: 0             Alignment Errors: 0
Max Collision Errors: 0        No Resource Errors: 0
Late Collision Errors: 0       Receive Collision Errors: 0
Deferred: 0                    Packet Too Short Errors: 0
SQE Test: 0                    Packet Too Long Errors: 0
Timeout Errors: 0              Packets Discarded by Adapter: 0
Single Collision Count: 0      Receiver Start Count: 0
Multiple Collision Count: 0
Current HW Transmit Queue Length: 0

General Statistics:
-------------------
No mbuf Errors: 0
Adapter Reset Count: 0
Adapter Data Rate: 200
Driver Flags: Up Broadcast Running 
Simplex 64BitSupport ChecksumOffload 
PrivateSegment DataRateSet

10/100/1000 Base-TX PCI-X Adapter (14106902) Specific Statistics:
--------------------------------------------------------------------
Link Status: Up
Media Speed Selected: 100 Mbps Full Duplex 
Media Speed Running: 100 Mbps Full Duplex 
PCI Mode: PCI-X (100-133) 
PCI Bus Width: 64-bit Jumbo
Frames: Disabled 
TCP Segmentation Offload: Enabled
TCP Segmentation Offload Packets Transmitted: 260772859
TCP Segmentation Offload Packet Errors: 0 
Transmit and Receive Flow Control Status: Disabled 
Transmit and Receive Flow Control Threshold (High): 32768 
Transmit and Receive Flow Control Threshold (Low): 24576 
Transmit and Receive Storage Allocation (TX/RX): 16/48
#

Use arp to the find out the MAC address of a remote host (see Listing 15), assuming that the host in question is known to your host and therefore has an entry in the cache.

Listing 15. Displaying a host entry in the arp table
# arp testhost
testhost (10.217.1.206) at 0:c:29:44:90:28 [ethernet] stored in bucket 0
#

Use ping to force an entry for the remote host into the arp cache if one doesn't exist.

Are packets being sent?

If a host is not responding in some way (either completely or on a particular port) and you need to verify that your host is sending out packets, reproduce the problem with the local application or command. Then, use tcpdump to display packets sent to the host using the example shown in Listing 16.

Listing 16. Display packets destined for a specific host
# tcpdump -i en2 dst host testhost
tcpdump: listening on en2
10:08:24.912057892 myhost.46183 > testhost.22: P 1299060979:1299061027(48) 
					       ack 3373421618 win 17520 (DF) [tos 0x10]
10:08:25.009291439 myhost.46183 > testhost.22: P 1:49(48) ack 48 win 17520 (DF) 
					       [tos 0x10]
10:08:25.093832676 myhost.46183 > testhost.22: . ack 96 win 17520 (DF) 
					       [tos 0x10]
10:08:25.249319253 myhost.46183 > testhost.22: P 1299061075:1299061123(48) ack 3373421714 
					       win 17520 (DF) [tos 0x10]
^C
53 packets received by filter
0 packets dropped by kernel
#

If no packets are seen leaving your host, then either there's a problem with the sending application or with the interface or routing (which you can diagnose using the previous diagnostic steps in this section).

Are packets being received?

If a host is not responding in some way (either completely or on a particular port) and you need to verify that packets are not being received from that host, reproduce the problem with the local application or command. Then, use tcpdump to display packets received from the host using the example shown in Listing 17.

Listing 17. Display packets sent by a specific host
# tcpdump -i en2 src host testhost
tcpdump: listening on en2
10:10:38.505848354 testhost.22 > myhost.46183: . ack 130 win 24820 (DF) [tos 0x10]
10:10:38.505916972 testhost.22 > myhost.46183: F 529:529(0) ack 225 win 24820 (DF) 
					       [tos 0x10]
10:10:43.855153846 testhost > myhost: icmp: echo reply
10:10:44.855224394 testhost > myhost: icmp: echo reply
^C
102 packets received by filter
0 packets dropped by kernel
#

If you have verified that your host is sending packets (using Are packets being sent?) and no packets are being received, this means that the host is not responding, the service on the host is not responding, or there is no route back (either one does not exist or the traffic is blocked by a firewall along one that does).

Connection made but application or command fails

If users are complaining that an application or command appears to establish a successful connection but fails afterwards, see whether the command has a debug, trace, or verbose option, and rerun to see whether any additional output produced identifies potential root cause. For example, both ssh and scp have a verbose switch (-v) that can provide details of the protocol exchange between client and server as follows:

  1. The connection is established to the remote host on TCP port 22.
  2. Local private key files are identified.
  3. Protocol versions are exchanged and agreed upon.
  4. A remote host key is identified and matched to the entry in the local known_hosts file for that host.
  5. Key authentication is tried for each private key type found locally.
  6. When these fail, the user is finally prompted for a password as authentication.
  7. The user successfully enters it, login is successful, and a shell prompt is presented.

The verbose option here can help identify whether any of those steps fails and the probable cause.

Use truss to trace the command or the process running the remote service (see Tracing a problem application or daemon).

Tracing a problem application or daemon

If the previous steps fail to uncover the exact root cause of the problem and you need to diagnose connectivity issues further, use truss to run a verbose system call trace of the command.

Also, use truss to run a verbose system call trace of the daemon that runs the remote service where you are attempting to make a connection. You may need to ask the systems administrator of the remote host for help if you don't have access or don't have access to the user running the process or the root user.

When a system call returns with an error, the non-zero return code is shown marked with Err# followed by the error number and an error code (for example, ENOSPC). Standard error codes can be found in /usr/include/sys/errno.h and can help indicate the cause of the error. For example, a system call returning with Err#2 ENOENT (No such file or directory) would indicate that the command is expecting to find a file or directory but can't and subsequently fails. A system call returning with Err#28 ENOSPC (No space left on device) would indicate that a disk or file system is full, potentially causing the daemon to fail to respond to service requests.

The verbose trace will display data from the parameters of system calls (-x all) and the contents of the buffer for both Read (-r all) and Write (-w all) calls. The contents of these buffers can sometimes identify root cause, as well.

You can also use the -D switch to display each system call with a time delta, representing the elapsed time in seconds since the last event. You should look for long time deltas, as these could indicate delays and lead to long response times and poor performance.

Conclusion

Network problem diagnosis does not need to be the obstacle that some systems administrators feel it is. This series has shown that armed with the correct knowledge, many problems you might normally take to a network administrator can be diagnosed, analyzed, and root cause identified, thus making the network administrator's job a lot easier when fixing it. In some cases, you'll even find that it's something you can fix yourself. Happy troubleshooting!


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=502736
ArticleTitle=Network problem determination: AIX tools for a system administrator: Part 2, Detailed diagnosis and troubleshooting
publish-date=08032010