UNIX network performance analysis

Quick methods for finding UNIX performance problems

Knowing your UNIX® network layout will go a long way with understanding your network and how it operates. But what happens when the performance of your UNIX network and the speed at which you can transfer files or connect to services suddenly reduces? How do you diagnose the issues and work out where in your network the problems lie? This article looks at some quick methods for finding and identifying performance issues and the steps to start resolving them.

Martin Brown, Professional writer, Freelance

Martin Brown has been a professional writer for over eight years. He is the author of numerous books and articles across a range of topics. His expertise spans myriad development languages and platforms -- Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Mac OS/X and more -- as well as Web programming, systems management and integration. Martin is a regular contributor to ServerWatch.com, LinuxToday.com and IBM developerWorks, and a regular blogger at Computerworld, The Apple Blog and other sites, as well as a Subject Matter Expert (SME) for Microsoft. He can be contacted through his Web site at http://www.mcslp.com.



08 September 2009

Also available in Chinese Russian

Introduction

The performance of your network can have a significant impact on the general performance and reliability of the rest of your environment. If different applications and services are waiting for data over the network, or your clients are having trouble connecting or receiving the information, then you need to address these issues.

Performance issues can also affect the reliability of your applications and environment, and can both be triggered by network faults, and in some cases they can even be the reason for a network fault. To understand and diagnose network issues, you first need to understand the nature of the issue; usually the problem will be related either to a latency or a bandwidth issue.

In general, network performance issues are often tied to the underlying hardware; you cannot exceed the physical limits of the network environment. All performance issues are also usually relevant to a specific protocol or system, such as NFS or Web access. But you can diagnose and identify the issues from within the operating system so that you can determine the correct course of action.

This article looks at the following steps involved in identifying performance issues:

  • Getting a baseline performance level
  • Determining where the problem lies
  • Getting statistics
  • Identifying the bottleneck

Understanding network metrics

To understand and diagnose performance issues, you first need to determine your baseline performance level. Let's first introduce two of the key concepts used in determining baseline performance: network latency and network bandwidth.

Network latency

The network latency is the time between sending a request to a destination and the destination actually receiving the sent packet. As a metric for network performance, increased latency is a good indicator of a busy network, as it either indicates that the number of packets being transmitted exceeds the capacity, or that the senders of data are having to wait before either transmission or re-transmission.

Network latency can also be introduced when the complexity of the network and the number of hosts or gateways that a packet has to travel through increases. The length of cable between points can also have an effect on the latency. For long distances, traditional copper cable will always be slower than using a fibre optic connection.

Network latency is also different from application latency. Network latency deals exclusively with the transmission of packets over the network, while application latency refers to the delay between the application receiving a request and its ability to respond.

Network bandwidth

Bandwidth is a measure of the number of packets that can be transmitted over a network during a specific period of time. The bandwidth affects how much data can be transmitted, and will either limit the transmission of data to one host to the practical maximum supported by the network connection, or will limit the aggregate transmission rate when dealing with multiple simultaneous connections.

The network bandwidth should, in theory, never change, unless you change the networking interface and hardware. The major variable within network bandwidth is in the number of hosts using the network at any given time.

For example, a 1GB Ethernet interface can talk 1GB to one other network host, 100MB to ten simultaneous hosts, or 10MB to 100 hosts. In reality, of course, the sustained bandwidth is not often required. There will be many hundreds of smaller requests from a number of different hosts over a period of time, and so the available bandwidth of a server can appear much greater than the sum of the client bandwidth.


Getting statistics

Before you can identify whether there is a problem within your network, you first need to have a baseline performance on which to base your assumptions. To do this you must check the various parameters -- latency, performance and any tests relevant to your network application environment -- to determine the performance and then monitor and compare this over time.

When performing the baseline networking tests, you should do them under controlled conditions. Ideally, you should perform them under both isolated (meaning with no other network traffic) and with typical network traffic to give you the two baselines:

  • For the isolated monitoring, you should check the performance between the server and one or more clients when there is no other traffic on the network. This means either shutting down other services, or, ideally, putting the server and client into an isolated network environment completely separate (but identical to) your standard network environment
  • For the standard monitoring, you should have the clients and servers attached to your standard network, and have the normal background traffic working, but all application-specific traffic (such as e-mail, file serving, Web serving) disabled, except on the server that you are testing.

For the actual testing process, there are a number of standard tools and tests that you can perform to determine your baseline values.

Measuring latency

The ping tool is well known to all network administrators as a basic tool for checking the availability and latency of a network device. Ping should work with most machines, both clients and servers, providing they have been configured to respond to the ICMP packets that the ping tool sends to the device. Essentially, ping sends an echo packet to the device, and expects the device to echo the packet contents back.

During the process, ping can monitor the time it takes to send and receive the response, which can be an effective method of measuring the response time of the echo process. In the simplest form, you can send an echo request to a host and find out the response time (see Listing 1).

Listing 1. Using ping to determine latency
$ ping example

PING example.example.pri (192.168.0.2): 56 data bytes
64 bytes from 192.168.0.2: icmp_seq=0 ttl=64 time=0.169 ms
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.167 ms
^C
--- example.example.pri ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.167/0.168/0.169/0.001 ms

You need to use Control-C to stop the ping process. On Solaris and AIX®, you need to use the -s option to send more than one echo packet and get the timing information. For getting baseline figures, you can use the -c option (on Linux®) to specify the count. On Solaris/AIX, you must specify the packet size (the default is 56 bytes), and the number of packets to send so that you do not have to manually terminate the process. You can then use this to extract the timing information automatically (see Listing 2).

Listing 2. Specifying the packet size when using ping on Solaris/AIX
$ ping -s example 56 10
PING example: 56 data bytes
64 bytes from example.example.pri (192.168.0.2): icmp_seq=0. time=0.143 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=1. time=0.163 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=2. time=0.146 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=3. time=0.134 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=4. time=0.151 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=5. time=0.107 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=6. time=0.142 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=7. time=0.136 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=8. time=0.143 ms
64 bytes from example.example.pri (192.168.0.2): icmp_seq=9. time=0.103 ms

----example PING Statistics----
10 packets transmitted, 10 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 0.103/0.137/0.163/0.019

The example in Listing 2 was made during a quiet period on the network. If the host being checked (or the network itself) was busy during the testing period, the ping times could be increased significantly. However, ping alone is not necessarily an indicator of a problem, but it can occasionally give you a quick idea if there is something that needs to be identified.

It is possible to switch off support for ping, and so you should ensure that you can reach the host before using it as a verification that a host is available.

Ideally, you should track the ping times between specific hosts over a period of time, and even continually, so that you can track the average response times and then identify where to start looking.

Using sprayd

The sprayd daemon and the associated spray tool send a large stream of packets to a specified host and determine how many of those packets get a response. As a method for measuring the performance of a network, it should not be relied on as a performance metric because it uses a connectionless transport mechanism. By definition, packets sent using connectionless transport are not guaranteed to reach their destination, and so dropped packets are allowed in the communication anyway.

That said, using spray can tell you whether there is a lot of traffic on the network, because if the connectionless transport (UDP) is dropping packets, then it probably means the network (or the host) is too busy to carry the packets.

Spray is available on Solaris and AIX, and some other UNIX platforms. You may need to enable the spray daemon (usually through inetd) to use it. Once the sprayd daemon has been started, you can run spray specifying the hostname (see Listing 3).

Listing 3. Using spray
$ spray tiger
sending 1162 packets of length 86 to tiger ...
        101 packets (8.692%) dropped by tiger
        70 packets/sec, 6078 bytes/sec

As already mentioned, the speed should not be relied upon, but the dropped packet counts can be a useful metric.


Using simple network transfer tests

The best method for determining the bandwidth performance of your network is to check the actual speed when transferring data to or from the machine. There are lots of different tools that you can use to perform the tests across a number of different applications and protocols, but usually the simplest method is the most effective one.

For example, to determine the network bandwidth when transferring a file over the network using NFS, you can time a simple file transfer test. To create a simple test, create a large file using mkfile (for example, 2GB: $ mkfile 2g 2gbfile), and then time how long it takes to transfer the file over a network to another machine (see Listing 4).

Listing 4. Timing the length of time to transfer a file over a network to another machine
$ time cp /nfs/mysql-live/transient/2gbfile .

real	3m45.648s
user	0m0.010s
sys	0m9.840s

You should run the tests multiple times and then take the average of the transfer process to get an idea of the standard performance.

You can automate the copy and timing process by using a Perl script, like the one in Listing 5.

Listing 5. Automate the copy and timing process with a Perl script
#!/usr/bin/perl
               
use Benchmark; 
use File::Copy;
use Data::Dumper;
                 
my $file = shift or die "Need a file to copy from\n";
my $srcdir = shift or die "Need a source directory to copy from\n";
my $count = shift || 10;
                        
my $t = timeit($count,sub {copy(sprintf("%s/%s",$srcdir,$file),$file)}); 
                 
printf("Time is %.2fs\n",($t->[0]/$count));

To execute, supply the name of the source file and the source directory, and an optional count of the number of copies to make. You can then execute the script and get a time (see Listing 6).

Listing 6. Executing the Perl script
$ ./timexfer.pl 2gbfile /nfs/mysql-live/transient 20
Time is 28.45s

You can use this both to create a baseline figure and during normal operations to check the transfer performance.


Diagnosing a problem

Typically, you will identify a network problem only when a network-related application fails for some reason. However, it is important to identify that the problem is network related and not a problem elsewhere.

First, you should try to reach the machine using ping. If the machine does not respond to a ping request, and other network communication does not work, then your first option should be to check the physical cables and make sure everything is still connected.

If you can still connect to the machine, but the ping time is increased, then you need to determine where the problem lies. An increase in ping times can in rare cases be related to the load on the machine, but more often than not indicates an issue with the network.

Once you get a long ping time from one machine, you should run ping from another machine on the network, ideally on a different network switch, to find out if the problem is related to the specific machine or the network.

Checking network stats

If the ping times are higher than you expect, then you should start to get some basic statistics about the network interface you are using to see if the problem is related to the network interface, or a specific protocol.

Under Linux, you can get some basic network statistic information by using the ifconfig tool (see Listing 7).

Listing 7. Getting basic network statistic information using the ifconfig tool
$ ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:1a:ee:01:01:c0  
          inet addr:192.168.0.2  Bcast:192.168.3.255  Mask:255.255.252.0
          inet6 addr: fe80::21a:eeff:fe01:1c0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7916836 errors:0 dropped:78489 overruns:0 frame:0
          TX packets:6285476 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:11675092739 (10.8 GiB)  TX bytes:581702020 (554.7 MiB)
          Interrupt:16 Base address:0x2000

The important rows are those beginning RX and TX, which show information about the packets sent and received. The packets value is a simple count of the packets transferred. The errors, dropped, and overruns figures show how many of the packets indicated some kind of fault. A high number of dropped packets in comparison to the packets sent probably indicate that the network is busy.

You can also get extended statistic information on all platforms by using the netstat tool. Under Linux, the tool provides more specific base protocol statistics, such as the packet transmissions for TCP-IP and UDP packet types. Again, the information contains some basic statistics (see Listing 8).

Listing 8. Using netstat
$ netstat -s
Ip:
    8437387 total packets received
    1 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    8437383 incoming packets delivered
    6820934 requests sent out
    6 reassemblies required
    3 packets reassembled ok
Icmp:
    502 ICMP messages received
    3 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 410
        echo requests: 82
        echo replies: 10
    1406 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 1313
        echo request: 11
        echo replies: 82
IcmpMsg:
        InType0: 10
        InType3: 410
        InType8: 82
        OutType0: 82
        OutType3: 1313
        OutType8: 11
Tcp:
    8361 active connections openings
    6846 passive connection openings
    1 failed connection attempts
    164 connection resets received
    33 connections established
    8305361 segments received
    6688553 segments send out
    640 segments retransmitted
    0 bad segments received.
    676 resets sent
Udp:
    126083 packets received
    1294 packets to unknown port received.
    0 packet receive errors
    130335 packets sent
UdpLite:
TcpExt:
    5 packets pruned from receive queue because of socket buffer overrun
    6792 TCP sockets finished time wait in fast timer
    5681 delayed acks sent
    Quick ack mode was activated 11637 times
    150861 packets directly queued to recvmsg prequeue.
    74333 bytes directly in process context from backlog
    9141882 bytes directly received in process context from prequeue
    3608274 packet headers predicted
    42627 packets header predicted and directly queued to user
    77132 acknowledgments not containing data payload received
    374105 predicted acknowledgments
    2 times recovered from packet loss by selective acknowledgements
    77 congestion windows recovered without slow start after partial ack
    1 TCP data loss events
    17 timeouts after SACK recovery
    2 fast retransmits
    8 retransmits in slow start
    236 other TCP timeouts
    1453 packets collapsed in receive queue due to low socket buffer
    11634 DSACKs sent for old packets
    2 DSACKs sent for out of order packets
    2 DSACKs received
    77 connections reset due to unexpected data
    50 connections aborted due to timeout
    TCPDSACKIgnoredNoUndo: 1
    TCPSackShiftFallback: 23
IpExt:
    InBcastPkts: 4126

Under Solaris and other UNIX variants, the information provided by netstat differs depending upon the platform. For example, under Solaris, you get detailed statistics for each protocol, and separate information for IPv4 and IPv6 connections (see Listing 9). The output in the listing has been truncated.

Listing 9. Using netstat on Solaris
$ netstat -s

RAWIP   rawipInDatagrams    =   440     rawipInErrors       =     0
        rawipInCksumErrs    =     0     rawipOutDatagrams   =    91
        rawipOutErrors      =     0

UDP     udpInDatagrams      = 15756     udpInErrors         =     0
        udpOutDatagrams     = 16515     udpOutErrors        =     0

TCP     tcpRtoAlgorithm     =     4     tcpRtoMin           =   400
        tcpRtoMax           = 60000     tcpMaxConn          =    -1
        tcpActiveOpens      =  1735     tcpPassiveOpens     =    54
        tcpAttemptFails     =     2     tcpEstabResets      =    35
        tcpCurrEstab        =     2     tcpOutSegs          =13771839
        tcpOutDataSegs      =13975728   tcpOutDataBytes     =1648876686
        tcpRetransSegs      = 90215     tcpRetransBytes     =130340273
        tcpOutAck           =151539     tcpOutAckDelayed    =  5570
        tcpOutUrg           =     0     tcpOutWinUpdate     =    31
        tcpOutWinProbe      =    86     tcpOutControl       =  3750
        tcpOutRsts          =    63     tcpOutFastRetrans   =     6
        tcpInSegs           =7548720
        tcpInAckSegs        =2882026    tcpInAckBytes       =1648874900
        tcpInDupAck         =4413016    tcpInAckUnsent      =     0
        tcpInInorderSegs    =415007     tcpInInorderBytes   =367832646
        tcpInUnorderSegs    =  7650     tcpInUnorderBytes   =10389516
        tcpInDupSegs        =   222     tcpInDupBytes       = 74649
        tcpInPartDupSegs    =     0     tcpInPartDupBytes   =     0
        tcpInPastWinSegs    =     0     tcpInPastWinBytes   =     0
        tcpInWinProbe       =     0     tcpInWinUpdate      =     2
        tcpInClosed         =    33     tcpRttNoUpdate      =   660
        tcpRttUpdate        =2880379    tcpTimRetrans       =  2262
        tcpTimRetransDrop   =    10     tcpTimKeepalive     =   630
        tcpTimKeepaliveProbe=   314     tcpTimKeepaliveDrop =    17
        tcpListenDrop       =     0     tcpListenDropQ0     =     0
        tcpHalfOpenDrop     =     0     tcpOutSackRetrans   = 69348
...

In all cases, you are looking for a high level of error packets, retransmissions, or dropped packet transmission, all of which indicate that the network is busy. If the error rate is excessively high compared to the packets transmitted or received, then it may indicate a fault with the network hardware.

Checking NFS stats

When checking problems related to NFS connections, and indeed most other network applications, you should first ensure that the issue is not related to a problem on the machine, such as high load (which will obviously affect the speed at which requests can be processed). A simple check using uptime and ps to identify the processes will tell you how busy the machine is.

You can also check the NFS statistics that are generated by the NFS service. The nfsstat command generates detailed stats for both the server and client side of the NFS service. For example, the statistics in Listing 10 show the detailed NFS v3 statistics for the server side of the NFS service, selected by using the -s command-line option and -v to specify the NFS version.

Listing 10. nfsstat command with -s and -v command-line options
$ nfsstat -s -v3  

Server rpc:
Connection oriented:
calls      badcalls   nullrecv   badlen     xdrcall    dupchecks  dupreqs    
36118      0          0          0          0          410        0          
Connectionless:
calls      badcalls   nullrecv   badlen     xdrcall    dupchecks  dupreqs
75         0          0          0          0          0          0          

Server NFSv3:
calls     badcalls  
35847     0         
Version 3: (35942 calls)
null        getattr     setattr     lookup      access      readlink
15 0%       190 0%      83 0%       3555 9%     21222 59%   0 0%        
read        write       create      mkdir       symlink     mknod       
9895 27%    300 0%      7 0%        0 0%        0 0%        0 0%        
remove      rmdir       rename      link        readdir     readdirplus 
0 0%        0 0%        0 0%        0 0%        37 0%       20 0%       
fsstat      fsinfo      pathconf    commit      
521 1%      2 0%        1 0%        94 0%       

Server nfs_acl:
Version 3: (0 calls)
null        getacl      setacl      getxattrdir 
0 0%        0 0%        0 0%        0 0%

A high number of badcalls values indicate that bad requests are being sent to the server, which may indicate that a client is not functioning correctly and submitting bad requests, either due to a software problem or faulty hardware.

Ping times in larger networks

If you can ping the machine, but the network performance is still a problem, then you need to determine where in your network the performance problem is located. In a larger network where you have different segments of your network separated by routers, you can use the traceroute tool determine whether there is a specific point in the route between the two machines where there is a problem.

Related to the ping tool, the traceroute tool will normally provide you with the ping times for each router that the network packets travel through to reach their destination. In a larger network this can help you isolate where the problem is. This can also be used to identify potential problems when sending packets over the Internet, where different routers are used at different points to transmit packets between different Internet Service Providers (ISP).

For example, the trace shown in Listing 11 is between two offices in the UK that use two different ISPs. In this case, the destination machine cannot be reached due to a fault.

Listing 11. traceroute between two offices in the UK
$ traceroute gendarme.example.com
traceroute to gendarme.example.com (82.70.138.102), 30 hops max, 40 byte packets
 1  voyager.example.pri (192.168.1.1)  14.998 ms  95.530 ms  4.922 ms
 2  dsl.vispa.net.uk (83.217.160.18)  32.251 ms  95.674 ms  30.742 ms
 3  rt-gw1.tcm.vispa.net.uk (62.24.228.1)  49.178 ms  47.718 ms  123.261 ms
 4  195.50.119.249 (195.50.119.249)  47.036 ms  50.440 ms  143.123 ms
 5  ae-11-11.car1.Manchesteruk1.Level3.net (4.69.133.97)  92.398 ms  137.382 ms  
52.780 ms
 6  PACKET-EXCH.car1.Manchester1.Level3.net (195.16.169.90)  45.791 ms  140.165 ms  
35.312 ms
 7  spinoza-ae2-0.hq.zen.net.uk (62.3.80.54)  33.034 ms  39.442 ms  33.253 ms
 8  galileo-fe-3-1-172.hq.zen.net.uk (62.3.80.174)  34.341 ms  33.684 ms  33.703 ms
 9  * * *
10  * * *
11  * * *
12  * * *

In a smaller network you are unlikely to have routers separating the networks, and so traceroute will not be of any help. Both ping and traceroute rely on being able to reach a host to determine the problem.

You are now armed with some knowledge and techniques to deal with UNIX network performance.


Summary

Identifying UNIX network performance issues is hard to determine from a single machine when the problem is usually widespread across the network. It is usually possible, though, to use ping and/or traceroute to narrow down the machine by looking at the performance from different points within your network. Once you have some starting points, you can use the other network tools to get more detailed information about the protocol or application that is causing the problem. This article looked at the basic methods to get baseline information and then the different tools that can be used to zero in on the issue.

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=426952
ArticleTitle=UNIX network performance analysis
publish-date=09082009