Practical experiences with OS Jitter

This page has not been liked. Updated 8/18/14, 5:03 PM by Bill_BurosTags: None

Aug 2014.   This article's relevancy and currency may be need to be improved due to out-of-date information.    Please consider helping to improve the contents or ask questions on the DeveloperWorks Forums.

 

 

 PowerLinux Architecture
   

These pages represent earlier work done on understanding, measuring, and mitigating "OS jitter" in an HPC cluster environment, specific to the IBM Power systems running Linux.    

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.

Related pages:

Introduction

In the realm of clustered computing, especially when dealing with massive scale out solutions, there are cases where work distributed across many systems (and many many processor cores) needs to complete in fairly predictable time-frames. An operating system, and the software stack being leveraged, can introduce some variability in the run-times of these "chunks" of work. This variability is often referred to as "OS Jitter".

"OS Jitter" is often an area rich in potential for researchers, graduate students, and theoretical behavior characteristics. In this paper, we take a slightly different approach focused on the practical usage of a Linux operating system available for Power 6 systems.

In an ideal environment, users would like nothing better than to see predictable behavior - e.g. the same application workload or benchmark performing a series of calculations completed in the same amount of time. The reason this does not occur, particularly as more nodes and CPUs are increasingly added into the mix, is likely OS Jitter related. OS Jitter can be something as simple as background daemons running and waking up occasionally to things much more elusive such as firmware or processor chip behavior.

The following report is intended to provide a practical view and better understanding of OS Jitter, in the context of Power and Linux systems.

 


A Practical View of OS Jitter

Here we define "OS Jitter" as interference that an application (be it a process or a thread) experiences. This can prevent the application from obtaining nearly identical predictable and repeatable results from a run-time completion perspective (ie: how long the task takes to complete).

It is a given that applications can provide the same computed results on a repeatable basis. What is usually not considered is the variations in the precise amount of time that calculations require. In the HPC/MPI world where hundreds (or thousands) of parallel compute processes are performing calculations and synchronization via MPI commands across multiple CPUs and multiple nodes, calculations can be only as fast as their slowest counterparts. On today's systems where results can be timed in microseconds, OSjitter can cause interference spikes measuring in milliseconds.

The name inference, OS Jitter, implies that jitter is related to the operating system related events. But in reality, applications can be interrupted by a variety of other events such as from other applications, kernel events, hypervisor (virtualized partition) events, firmware, driver modules, hardware interrupts, and even processing chip "features". This is why jitter can be difficult to trace. Even knowing what event triggered the jitter can require a much deeper understanding of the entire stack in order to eliminate the jitter source.

osjitter_tool:

vpsum:

  • vpsum provides improved granularity of what's happening on each core for each iteration
  • We expect to provide more details on vpsum in the near future

 

 


Running and tuning vpsum on OpenMPI

The syntax for running ./vpsum depends on whether machine/rank files are used. Beginning with OPENMPI 1.3, CPU affinity can be accomplished using machine/rank files. Prior to that, MPI processes must be launched via taskset commands per CPU in order for the processes to correctly affinitize.

 

Basic vpsum and quantum period tuning

The basic syntax for vpsum is:

./vpsum -n DURATION -i ITERATION -t TAGGED_DESCRIPTION

	where:    -n DURATION  number of calcs-per-iteration required to reach a quantum period 
                               (1ms, 10ms, 100ms, etc.)
                  -i ITERATION identifies how many iterations to performan and determines how long 
                               the measurement will occur (generally 5-10 minutes).
		  -t TAGGED_DESCRIPTION characters used to identify the output files. 
                               This is a descriptive reference for the run.

Generally, the DURATION is dependent on CPU clock speeds and compiler optimization levels. The general tuning goal is to 1) identify the duration required to achieve a specific quantum and 2) initiate several runs to tune the DURATION.

For example, 1ms is a good quantum to use for tuning.

When vpsum is run, output is produced containing the compute times for each iteration by CPU. A file for each MPI process will be generated in the vpsum_timing_files subdirectory.

Consider our test system which is a dual core 3826MHz P6 machine (without IB connections).

Using the following vpsum and parameters:

./vpsum -n 500000 -i 3000 -t mytest

hpc@js12c:~> ./vpsum -n 500000 -i 3000 -t mytest
                   tag = mytest
            iterations = 3000
   calcs_per_iteration = 500000
           total calcs = 1500000000
         elapsed_msecs = 1964
averaged 763747 calculations per msec

You can see from these results that the average calculations per millisecond was 763747.

Rerunning vpsum with 763747 results in the following:

hpc@js12c:~> ./vpsum -n 763747 -i 3000 -t mytest
                   tag = mytest
            iterations = 3000
   calcs_per_iteration = 763747
           total calcs = 2291241000
         elapsed_msecs = 3000
averaged 763747 calculations per msec

The elapsed_msecs is the exact amount of time we would like. Examining the output from vpsum_timing_files/vpsum.out.mytest.0 we see...

iteration, min_compute_time_taskID, min_compute_usec_time, min_compute_timebase_time, 
   max_compute_time_taskID, max_compute_usec_time, max_compute_timebase_time
0       999     1335497723742398        1335497724254047
1       999     1335497724254287        1335497724765807
2       998     1335497724766040        1335497725277567
3       1000    1335497725277788        1335497725789324
4       999     1335497725789535        1335497726301058
5       999     1335497726301295        1335497726812816
6       1000    1335497726813034        1335497727324575
7       999     1335497727324783        1335497727836319
8       1001    1335497727836557        1335497728348463

and we can say the 763747 duration yields compute times close to our 1ms (1000) target.

The same method can be completed for 10ms and 100ms (verifying output from vpsum_timing_files subdirectory). Another method is to multiple the 1ms calculated value by 10 (10ms) or 100 (100ms) to give a general estimate of the duration value. This estimate won't be as precise as tuning for each specific quantum.

 


Calculating run length

Generally, the ./vpsum program should be run for 8 to 10 minutes with the concept that this measurement period enables you to find OS jitter events within a reasonable time frame. Shorter time periods could result in missing key OS jitter events. Longer iterations increase the amount of output and using large iteration values for the smaller quantums (i.e. 1ms, .1ms) will produce a significant number of output rows.

 


Using mpirun with vpsum with rankfile

Examples for running vpsum with mpirun are as follows.

mpirun -np 4 --hostfile machines --rankfile myrankfile   \
       ./vpsum -n DURATION -i ITERATION -t DESCRIPTION

If nodes do not support IB connections, you can use the following syntax for ethernet connectivity:

mpirun --gmca btl sm,tcp,self -np 4 --hostfile machines --rankfile myrankfile     \ 
       ./vpsum -n DURATION -i ITERATION -t DESCRIPTION

Examples of a hostfile and rankfile is as follows:

machines
--------
node1 slots=4 (or total number of CPUS)
node2 slots=4


rankfile
--------
rank 0=js12a slot=1  (slot=1,3 identify odd CPUs)
rank 1=js12a slot=3
rank 2=js12c slot=1
rank 3=js12c slot=3

Using mpirun with vpsum without rankfile

vpsum using OPENMPI versions prior to 1.3 will require specific taskset commands for each MPI process. An example follows:

For a simple 2 core system with SMT on, you have to spell out the bindings per CPU.

mpirun --gmca btl sm,tcp,self    \
       -np 1 --host js12a taskset -c 1 ./vpsum -n DURATION -i ITERATION -t DESCRIPTION :     \
       -np 1 --host js12a taskset -c 3 ./vpsum -n DURATION -i ITERATION -t DESCRIPTION :     \
       -np 1 --host js12c taskset -c 1 ./vpsum -n DURATION -i ITERATION -t DESCRIPTION :     \
       -np 1 --host js12c taskset -c 3 ./vpsum -n DURATION -i ITERATION -t DESCRIPTION

The syntax for large numbers of nodes and CPUs becomes unreasonable without using scripts for constructing the command sequence.

 


OS Jitter mitigation steps

Dedicated SMT threads: Good for Jitter mitigation - but may have less performance overall.

In this section, we introduce four primary areas of OS Jitter mitigation

  1. Basic HPC compute node tunings: good recommendations for any HPC system compute node
  2. Linux scheduler tunings and interrupt mitigation:
  3. Future mainline kernels mitigation techniques: Things coming under development by Linux community
  4. Engineering (development lab) mitigation techniques: Potential micro-jitter from the hardware and firmware settings which can be deduced from hardware counter information and traces, but require engineering level

 


Basic HPC compute node tunings

This step is often recommended on any HPC compute node

  1. Dedicated full-core partitions
    • Shared CPU cycles, and micro-partitions, are not recommended for performance optimized applications
  2. Use the latest distro versions
    • SLES 11
  3. Control unnecessary daemons
    • Disable postfix
    • Disable nscd
    • Disable cron
    • Disable cdrom polling
  4. Turn off smt_snooze_delay

 

"basic" script

  • Copy and paste into "basic"
    #!/bin/bash
    
    echo "Making sure postfix is off"
    chkconfig postfix off
    /etc/init.d/postfix stop
    
    echo "Making sure name service caching is off"
    chkconfig nscd off
    /etc/init.d/nscd stop
    
    echo "Making sure cron daemon is off"
    chkconfig cron off
    /etc/init.d/cron stop
    
    echo "Making sure cdrom polling is off"
    hal-disable-polling --device /dev/sr
    
    ppc64_cpu --smt-snooze-delay=4000000000
    ppc64_cpu --smt-snooze-delay
    ppc64_cpu --smt
    cat /proc/ppc64/lparcfg | grep capped
    cat /proc/ppc64/lparcfg | grep shared
    uname -r
    
  • Running it:
    # ./basic
    Making sure postfix is off
    Shutting down mail service (Postfix)            done
    Making sure name service caching is off
    Shutting down Name Service Cache Daemon         done
    Making sure cron daemon is off
    Shutting down CRON daemon                       done
    
    Making sure cdrom polling is off
    Cannot find device /dev/sr.
    smt_snooze_delay is 4000000000
    smt is on
    capped=1
    shared_processor_mode=0
    2.6.27.19-5-ppc64
    

 


Linux scheduler tunables and interrupt mitigation

This step is recommended as the next steps for jitter mitigation. In this realm we begin considering whether workloads are running on one of the two hardware threads for a processor core, or both.

For jitter mitigation, there are some interesting approaches where

  1. Making sure the IRQ balancer is off
    • Shutting down irqbalance
  2. Disable realtime bandwidth reservation
    • Limits the CPU time allocated towards scheduling group threads and disables realtime bandwidth reservation
  3. Reduce hung_task_check_count
    • Limits the number of tasks checked by hung_task daemon
  4. Disable software watchdog
    • By default, SLES11 looks for hung tasks every 60 seconds
  5. Change CPU frequency governor to "performance"
    •  
#!/bin/bash

echo "Making sure the IRQ balancer is off"
chkconfig irq_balancer off
/etc/init.d/irq_balancer stop

echo "Disable realtime bandwidth reservation"
echo -1 > /proc/sys/kernel/sched_rt_runtime_us

echo "Reduce hung_task_check_count"
echo 1 > /proc/sys/kernel/hung_task_check_count

echo "Disable software watchdog"
echo -1 > /proc/sys/kernel/softlockup_thresh

echo "Reduce vmstat polling"
echo 20 > /proc/sys/vm/stat_interval

if [ -d /sys/devices/system/cpu/cpu0/cpufreq/ ];then
        echo "Change CPU frequency governor to performance"
        for file in /sys/devices/system/cpu/*/cpufreq/scaling_governor;
                do echo performance > $file ; done
fi

Running it..

Disable realtime bandwidth reservation
Reduce hung_task_check_count
Disable software watchdog
reduce vmstat polling

 


MPI Task Affinity on Odd CPUS

Previous experience has shown that the best mitigation technique is to run the mpi applications on odd cpus which represent thread 1 (T1) of the P6 Core. The concept is as follows:

  1. prior to launching the mpi application, all existing processes are moved to T0 CPU threads (thread 0). This is accomplished by disabling (offlining) all the T1 CPU threads.
  2. The T1 threads are then onlined.
  3. When the MPI application is run, the MPI application processes are assigned to the odd cpus.
  4. For additional mitigation, all nodes can be booted using the boot parameter "isolcpus=1,3,5,etc..." (odd cpus). The isolcpus parameter removes the specified CPUs from the kernel scheduler SMP balancing and scheduler algorithms.

The following script can be used to isolate the odd cpus.

#!/bin/bash

echo "setting isolated odd cpus"
ODD_CPUS=1
START_CPU=1
NR_CPUS=$(grep -c '^processor' /proc/cpuinfo)
echo "isolating odd cpus"
grep -q cpuset /proc/mounts
if [ $? -ne 0 ]; then
	mkdir -p /dev/cpuset
	mount -t cgroup -o cpuset none /dev/cpuset
fi
for (( c=${START_CPU} ; c<${NR_CPUS} ; c+=2 )); do
	/bin/echo 0 > /sys/devices/system/cpu/cpu${c}/online
done
echo "disabling sched_load_balance"
/bin/echo 0 > /dev/cpuset/cpuset.sched_load_balance
echo "onlining odd cpus"
for (( c=${START_CPU} ; c<${NR_CPUS} ; c+=2 )); do
	/bin/echo 1 > /sys/devices/system/cpu/cpu${c}/online
done

 


IRQ interrupts

On an untuned system, IRQ interrupts can occur on CPUs specified in the /proc/irq/.../smp_affinity mask fields which can be on all CPUs.

Experiments have shown that IRQ interrupts isolated to thread 0 (T0) on each core provide the better mitigation results for our vpsum. The explanation is that if IRQ interrupts occur on the same thread as the application thread, the application task is preempted to service the interrupt.

Alternatively, IRQ interrupts can all be assigned to CPU 0. For our environment, we saw slightly better results with this technique. The reasoning is that if IRQs occur on the same core, there is still disruption of the application thread. However, binding all the interrupts to CPU0 could cause a potential performance problem in environments where IRQ interrupts are heavy, such as applications with high I/O or network activity.

 


OS Jitter Boot Parameters

Several boot parameters are available to mitigate OS jitter:

1) rhash-entries=8192

  • The default kernel route cache hash table entry size can be over 20MB. Limiting the table size will reduce the length of specific OSjitter events.

2) isolcpus=1,3,5,7,9,11,...etc

  • By targeting the MPI processes to run on odd cpus, the next step is booting the kernel with "isolcpus=", thereby insuring that the Linux scheduler will remove the specified CPUs from the SMP balancing and scheduler algorithms.
  • Pieced together, the boot command will have the following:
{existing boot parameters...} rhash-entries=8192 isolcpus=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,etc....

To confirm that the parameters have been correctly applied, use "cat /proc/cmdline" after the system has been rebooted 

or can also use "dmesg | grep -E "route|isol".

 


Leveraging the 2.6.32 kernel

In newer mainline kernels, several features have been added that benefit the identification of OS jitter. First, upstream power architecture patches support additional enhancements of the OSjitter tool. OSjitter events can track static tracepoints, tasklets, workqueues and PHYP calls).

Second, substantial work on Performance Counters for Linux (PCL) has been added in 2.6.32. In additional to the previous hardware counters, software driven counters such as page faults, task migrations, and tracepoints are now available. With new perf tools, you can isolate hardware/software counters down to the the detail code level - thereby determining the occurrence of performance events as granular as small loop iterations.

The key component for the latest kernels is inclusion of the following config parameter:

CONFIG_PERF_EVENTS=y

 


Engineering level observations

Compugard is an rtas service that monitors system activity from PHYP. In the engineering lab, we have observed that disabling compugard can successfully assist with mitigating OS jitter.

The compugard programs will be made available, and when extracted, the programs available are disable_compugard.c, set_compugard.c and get_compugard.c and require compilation.

Compiling..

gcc -O2 -o disable_compugard disable_compugard.c
gcc -O2 -o get_compugard get_compugard.c
gcc -O2 -o set_compugard set_compugard.c

To check the current compugard setting, use "get_compugard" which returns either a 0 (disabled) or 1 (enabled).

To disable compugard, use "disable_compugard".

Note: In order to disable the compugard, the LPAR must have appropriate access.

To re-enable compugard, use "set_compugard 1".

 


Graphing vpsum output

Each run of vpsum can produce huge amounts of output in the vpsum_timing_files subdirectory. The output is essential for identifying compute timing variations.

Graphing scripts are also available to consolidate and present the results.

Since the vpsum_timing_files/vpsum* files can be spread across multiple nodes (unless using a shared file system), the first task is to consolidate the files into a single output directory (copy via scp, for example). After this consolidation step is completed, the graphs can be processed.

Note: The graph scripts requires octave and gnuplot installation. Once the required components have been installed (suggest using a x86 client), you can run the scripts.

The following scripts are available for graphing the vpsum results:

processvpsumdata.m -  primary octave graphing script to process and graph vpsum output
hdrloadmn.m - loads header information and data from vpsum files

The following parameters are passed to the processvpsumdata.m program:

  1. vpsum output file directory - directory containing consolidated files
  2. vpsum file suffix - the file suffix identifies the output files and includes text characters from "vpsum.out." up to the last file number (process identifier). Be sure to include the last period.
  3. number of total processes - total number of processes used for the run
  4. iterations - number of iterations specified for the run (outer loop value)

For example, vpsum files with the name vpsum.out.nt_nrq_nsl_od.9393188.30000.0.(PROCESS NO), would use the following syntax:

./processvpsumdata.m /data/vpsum_timing_files nt_nrq_nsl_od.9393188.30000.0. 128 30000

Output will be created in the target directory under the following three files:

  1. jitter_histogram - shows a histogram the amount of jitter percentage by iteration
  2. jitter_slowdown - OSjitter graph - shows amount of slowdown by iteratin
  3. jitter_summary - text file showing calculated values for mean slow down (OSjitter %) and other descriptive statistics

These files can be viewed from a web browser.

Example files:

jitter_summary:

Mean slow down 0.248488%

Median Min 9999
Absol. Min 9995
Mean   Min 9999
Mean   Max 10023
Absol. Max 10980

jitter_slowdown:

 

jitter_histogram: