Tuning your Linux system for more efficient parallel job performance

The Linux default network and network device settings might not produce optimum throughput (bandwidth) and latency numbers for large parallel jobs. The information that is provided describes how to tune the Linux network and certain network devices for better parallel job performance.

This information is aimed at private networks with high-performance network devices such as the Gigabit Ethernet network, and might not produce similar results for 10/100 public Ethernet networks.

The following table provides examples for tuning your Linux system for better job performance. By following these examples, it is possible to improve the performance of a parallel job that runs over an IP network.
Table 1. Network tuning
Network Tuning Factors Tuning for the current boot session Modifying the system permanently
arp_ignore - With arp_ignore set to 1, a device answers only to an ARP request if the address matches its own. echo '1' > /proc/sys/net/ipv4/conf/all/arp_ignore Add this line to the /etc/sysctl.conf file:
net.ipv4.conf.all.arp_ignore = 1
arp_filter - With arp_filter set to 1, the kernel answers only to an ARP request if it matches its own IP address. echo '1' > /proc/sys/net/ipv4/conf/all/arp_filter Add this line to the /etc/sysctl.conf file:
net.ipv4.conf.all.arp_filter = 1
rmem_default - Defines the default receive window size. echo '1048576' > /proc/sys/net/core/rmem_default Add this line to the /etc/sysctl.conf file:
net.core.rmem_default = 1048576
rmem_max - Defines the maximum receive window size. echo '2097152' > /proc/sys/net/core/rmem_max Add this line to the /etc/sysctl.conf file:
net.core.rmem_max = 2097152
wmem_default - Defines the default send window size. echo '1048576' > /proc/sys/net/core/wmem_default Add this line to the /etc/sysctl.conf file:
net.core.wmem_default = 1048576
wmem_max - Defines the maximum send window size. echo '2097152' > /proc/sys/net/core/wmem_max Add this line to the /etc/sysctl.conf file:
net.core.wmem_max = 2097152
Set device txqueuelen - Sets each network device, for example, eth0, eth1, and on. /sbin/ifconfig device_interface_name txqueuelen 4096 Not applicable
Turn off device interrupt coalescing - To improve latency. See sample script. This script must be run after each reboot. Not applicable
This sample script unloads the e1000 Gigabit Ethernet device driver and reloads it with interrupt coalescing disabled:
#!/bin/ksh
Interface=eth0
Device=e1000
Kernel_Version=`uname -r`
ifdown ${Interface}
rmmod ${Device}
insmod /lib/modules/${Kernel_Version}/kernel/drivers/net/${Device}/${Device}.ko \
InterruptThrottleRate=0,0,0
ifconfig ${Interface}
exit $?

MPI jobs use shared memory to handle intranode communication. You might need to modify the system default for allowable maximum shared memory size to allow a large MPI job to successfully enable shared memory usage. It is recommended that you set the system allowable maximum shared memory size to 256 MB or larger for supporting large MPI jobs.

To modify this limit for the current boot session, run the following command as root:
echo "268435456" > /proc/sys/kernel/shmmax
To modify this limit permanently, add the following line to the /etc/sysclt.conf file and reboot the system:
kernel.shmmax = 268435456
Table 2. Network tuning: ARP entries
Network tuning factors Tuning for the current boot session and updating it into the boot image
gc_thresh3 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh3 should be the maximum number of compute operating system nodes, plus 300. echo "5300" >/proc/sys/net/ipv4/neigh/default/gc_thresh3
gc_thresh2 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh2 should be 100 less than gc_thresh3. echo "5200" >/proc/sys/net/ipv4/neigh/default/gc_thresh2
gc_thresh1 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh1 should be 100 less than gc_thresh2 echo "5100" >/proc/sys/net/ipv4/neigh/default/gc_thresh1
gc_interval - The ARP garbage collection interval on the compute nodes should be high so that it does not process ARP cleanup. echo "1000000000" > /proc/sys/net/ipv4/neigh/default/gc_interval
gc_stale_time - The ARP stale time should be set high so that it does not get discarded. echo "2147483647" > /proc/sys/net/ipv4/neigh/default/gc_stale_time
base_reachable_time_ms - The ARP valid entry time (in milliseconds) should be set high so that it does not get discarded. echo "2147483647" > /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms

DNS caching should be enabled to minimize runtime host name resolution, especially if LDAP is also enabled in the cluster.