Tuning your Linux system for more efficient parallel job performance
The Linux default network and network device settings might not produce optimum throughput (bandwidth) and latency numbers for large parallel jobs. The information that is provided describes how to tune the Linux network and certain network devices for better parallel job performance.
This information is aimed at private networks with high-performance network devices such as the Gigabit Ethernet network, and might not produce similar results for 10/100 public Ethernet networks.
Network Tuning Factors | Tuning for the current boot session | Modifying the system permanently |
---|---|---|
arp_ignore - With arp_ignore set to 1, a device answers only to an ARP request if the address matches its own. | echo '1' > /proc/sys/net/ipv4/conf/all/arp_ignore | Add this line to the /etc/sysctl.conf file:
|
arp_filter - With arp_filter set to 1, the kernel answers only to an ARP request if it matches its own IP address. | echo '1' > /proc/sys/net/ipv4/conf/all/arp_filter | Add this line to the /etc/sysctl.conf file:
|
rmem_default - Defines the default receive window size. | echo '1048576' > /proc/sys/net/core/rmem_default | Add this line to the /etc/sysctl.conf file:
|
rmem_max - Defines the maximum receive window size. | echo '2097152' > /proc/sys/net/core/rmem_max | Add this line to the /etc/sysctl.conf file:
|
wmem_default - Defines the default send window size. | echo '1048576' > /proc/sys/net/core/wmem_default | Add this line to the /etc/sysctl.conf file:
|
wmem_max - Defines the maximum send window size. | echo '2097152' > /proc/sys/net/core/wmem_max | Add this line to the /etc/sysctl.conf file:
|
Set device txqueuelen - Sets each network device, for example, eth0, eth1, and on. | /sbin/ifconfig device_interface_name txqueuelen 4096 | Not applicable |
Turn off device interrupt coalescing - To improve latency. | See sample script. This script must be run after each reboot. | Not applicable |
#!/bin/ksh
Interface=eth0
Device=e1000
Kernel_Version=`uname -r`
ifdown ${Interface}
rmmod ${Device}
insmod /lib/modules/${Kernel_Version}/kernel/drivers/net/${Device}/${Device}.ko \
InterruptThrottleRate=0,0,0
ifconfig ${Interface}
exit $?
MPI jobs use shared memory to handle intranode communication. You might need to modify the system default for allowable maximum shared memory size to allow a large MPI job to successfully enable shared memory usage. It is recommended that you set the system allowable maximum shared memory size to 256 MB or larger for supporting large MPI jobs.
echo "268435456" > /proc/sys/kernel/shmmax
To modify this limit
permanently, add the following line to the /etc/sysclt.conf file and reboot the
system: kernel.shmmax = 268435456
Network tuning factors | Tuning for the current boot session and updating it into the boot image |
---|---|
gc_thresh3 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh3 should be the maximum number of compute operating system nodes, plus 300. | echo "5300" >/proc/sys/net/ipv4/neigh/default/gc_thresh3 |
gc_thresh2 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh2 should be 100 less than gc_thresh3. | echo "5200" >/proc/sys/net/ipv4/neigh/default/gc_thresh2 |
gc_thresh1 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh1 should be 100 less than gc_thresh2 | echo "5100" >/proc/sys/net/ipv4/neigh/default/gc_thresh1 |
gc_interval - The ARP garbage collection interval on the compute nodes should be high so that it does not process ARP cleanup. | echo "1000000000" > /proc/sys/net/ipv4/neigh/default/gc_interval |
gc_stale_time - The ARP stale time should be set high so that it does not get discarded. | echo "2147483647" > /proc/sys/net/ipv4/neigh/default/gc_stale_time |
base_reachable_time_ms - The ARP valid entry time (in milliseconds) should be set high so that it does not get discarded. | echo "2147483647" > /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms |
DNS caching should be enabled to minimize runtime host name resolution, especially if LDAP is also enabled in the cluster.