OS Jitter Mitigation Techniques

This page has not been liked. Updated 4/30/13 4:00 PM by PowerLinuxTeamTags: None

 

 PowerLinux Architecture
   

These pages represent previous work done on understanding, measuring, and mitigating "OS jitter" in an HPC cluster environment, specific to the IBM Power systems running Linux.

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.

Related pages:

 


Introduction

As systems scale both up and out, "OS jitter" (sometimes referred to as operating system noise) continues to be an important performance consideration, particularly in both MPI and real-time environments. However, OS jitter is not necessarily something easy to measure, understand, or address.

In a perfect world and randomness aside, you would like nothing better than to see predictable behavior - e.g. the same application workload or benchmark performing a series of calculations completed in the precise same amount of time. In cluster computing, one reason this does not occur, particularly as more nodes and CPUs increasingly are added into the mix, can often be OS jitter related. OS Jitter can be something as simple as background daemons periodically running to things much more elusive such as firmware or even processor chip behavior.

The following text should provide a better understanding of OS jitter. We discuss an OS jitter tool, which can be used for identifying OS jitter. We identify various Linux jitter mitigation techniques to help reduce OS jitter. We also provide several comparison results implementing the techniques.

 


What is OS jitter?

Here we define OS jitter has any interference that an application process or thread experiences. An application typically runs on specified CPUs. When the application is preempted on the CPU it is running on for other services this can prevent the application from obtaining predictable and repeatable results. It is a given that applications can provide the same computed results on a repeatable basis. What is usually not considered is the variations in the precise amount of time that calculations require. In the HPC/MPI world where hundreds (or thousands) of parallel compute processes are performing calculations and synchronization via MPI functions across multiple CPUs and multiple nodes, overall cluster calculations can be only as fast as their slowest counterparts. On today's systems where results can be timed in microseconds, OS jitter can cause interference measured in milliseconds.

The name inference, OS jitter, implies that jitter is related to OS (Linux in our case) related events. But in reality, applications can be interrupted by a variety of other events such as from other applications, kernel events, hypervisor (virtualized partition) events, firmware, driver modules, hardware interrupts, etc. This is why jitter can be difficult to trace. Even knowing what event triggered the jitter can require a much deeper understanding of the entire stack in order to eliminate the jitter source.

 


Scheduler, interrupts and event spikes

At a high level, it can be suggested to think of OS jitter in three terms 

1) the Linux OS scheduler and 

2) hardware and software interrupts and 

3) event spikes.

The Linux OS scheduler, among other things, manages multiple tasks running on many CPUs. A large MPI workload, for example, could be running compute tasks on every core across multiple nodes. The scheduler on each node is responsible for providing each task with an allocated time slice of CPU resource. If each task does not complete in the allocated time slice, the task is rescheduled on either the same or a different CPU. The scheduler manages the priority of kernel tasks, daemons, and other application tasks requiring scheduling which could either preempt an MPI task - or extend the wait time on an MPI task. If a non-MPI task runs within a short time slice, the MPI tasks will be rescheduled and run. If the non-MPI task takes considerable time to complete, the impact is that the MPI task will see additional delay. These delays are seen as spikes and can be measured in terms of microseconds (us) to milliseconds (ms). The result is that the overall MPI job will take longer to complete. If the MPI tasks perform frequent MPI Barrier Synchronizations, then the delay is felt across all the CPUs and nodes.

Interrupts play a key role in OS jitter. Interrupt handlers must respond to hardware interrupts as quickly as possible. The handlers usually have two processing components - top-half and bottom-half. The top-half is performance critical and has only the necessary code to acknowledge the interrupt and prepare the code for the bottom-half which can employ specific interrupt handling methods such as work queues, softirqs and tasklets. Interrupt processing can occur on CPUs or CPU thread siblings which displace and impact the MPI process. Several interrupt mitigation techniques are discussed to reduce the impact of these events.

 


Measuring OS jitter

 


An "OSjitter" tool

A tool to measure OS jitter was developed by Anton Blanchard and essentially logs various tasks as they enter and exit each of the CPU's run queue. After logging this data, post-processing reporting tools can provide both detail and summarized views. The summarized view provides task id (PID), process name, a count (number of time the task was running), total amount of time it was running, shortest amount of time it ran on the CPU (min), longest amount of time it ran on the CPU (max), and average time in ran on the CPU (avg). All of the times are reported in milliseconds (ms). Microseconds (us) can be seen in the decimal places (i.e. 0.769 is 769 microseconds).

 

Sample OSjitter tool output

 

pid   name                       count    total(ms)   min(ms)   max(ms)   avg(ms)   period(ms)
    0 idle                       37477  4376964.954     0.000  8277.306   116.791      228.037
22049 vpsum                       3229    25034.494     0.001    10.177     7.753       36.704
22070 vpsum                       3165    25047.976     0.001    10.158     7.914       40.630
22071 vpsum                       3179    25044.649     0.001    10.107     7.878       39.646
22063 vpsum                       3187    25045.697     0.001    10.093     7.859       39.647
22074 vpsum                          7       63.689     5.554    10.065     9.098        0.000
22085 vpsum                          8       65.037     0.769    10.057     8.130        0.000
22079 vpsum                          7       62.331     4.626    10.031     8.904        0.000
22080 vpsum                          8       65.231     0.806    10.027     8.154        0.000
22064 vpsum                       2396    23122.308     0.001    10.027     9.650      366.265
22044 vpsum                       3143    25053.088     0.001    10.013     7.971       42.200

See the following documentation for the OSjitter tool usage.

 


VPSUM - a sample benchmark for observing run-time variations

We used a program named vpsum developed by John DiVirgilio (with help from many others) as our benchmark workload and subsequently used the OSjitter tool to investigate the behavior of scheduled tasks running during vpsum execution.

Our intent is to make the vpsum program available shortly, along with additional observations on the characteristics of using a program like vpsum.

Vpsum is an excellent program for investigating OSjitter (see OSjitter workload observations below). As an MPI application, vpsum does the following:

  1. initiates N processes. N is defined by the number of CPU cores times the number of nodes
  2. initiates MPI communications between all MPI tasks initiated by vpsum
  3. each vpsum process executes an outer and inner loop. For each outer loop iteration, an MPI_Barrier(MPI_COMM_WORLD) command is used to synchronize all of the processes across all nodes. The inner loop counter (duration) is tuned so that the inner loop takes approximately a specific quantum of time (1, 10 or 100 milliseconds) to complete. Additional studies with 100us (microseconds) were also performed. The outer loop (iteration) can be tuned to perform the inner loop a specified number of times to coincide with a length of 5 to 10 minutes.
  4. the inner loop performs a simple formula calculation (square a value and place results into 1024 byte array)
  5. the compute time for each inner loop iteration is reported to an output file for each task. A run with 256 processes will have 256 output files; one for each process.
  6. all of the output files are used as input to post-processing scripts.
  7. The scripts provide a graphic representation of the percent change between the timing values. When viewed, the percent change can be seen as "spikes". Based on the approximate iteration and OSjitter events, we can investigate the occurrence of events or anomalies that occur within a run.
  8. Post-processing scripts also calculate a "Mean slow down" value which is in effect, a value we use to report the effectiveness of the OSjitter techniques.

Sample output results for a 1ms timing loops are illustrated below. The first column is the iteration, the second column is the compute time and the third column is the time based clock value.

 

Sample vpsum program output

 

iteration, min_compute_time_taskID, min_compute_usec_time, min_compute_timebase_time,
    max_compute_time_taskID, max_compute_usec_time, max_compute_timebase_time
0  1001  512131
1  1001  512136
2  1000  512122
3  1001  512130
4  1003  513470
5  1001  512133
6  1000  512123
7  1001  512136
8  1000  512137
9  1000  512122
10  1001  512136
11  1001  512134
12  1001  512130

 


OS jitter workload characteristics

We were interested in a workload that was portable across operating systems, MPI implementations, compilers (XLC/gcc) and optimization levels and could provide identical compute time results. But we found different results both in compute times and OS jitter levels depending on which combination was used. So, a one size fits all program is difficult to achieve - and thus requires adjustment in tuning the inner loop iteration to achieve the desired quantum.

A second observation of OS jitter and vpsum is that from previous best practice experiences, vpsum processes run on a process per core basis. The two primary program functions are the MPI barrier sync and the tight compute loop. The tight compute loop essentially pegs the core at 100%. On a P6 SMT system there are two threads per CPU core (T0 and T1). Some of the mitigation techniques are general purpose (i.e. disabling services, reducing kernel activity). However several of the mitigation techniques are dependent on a process per core technique.

In reality, we have seen other benchmarks provide better performance results using a process per CPU thread - especially when the workload exhibits wait characteristics. As nodes are significantly scaled out, it is conceivable that due to OS jitter, workloads running on a process per core will eventually outperform multiple processes per core. Therefore, for general workloads, we would suggest an experimental approach when implementing mitigation techniques as some techniques may or may not work for your application and environment.

 


psnap and vpsum

In addition to vpsum, we also experimented with psnap.

PSNAP is a frequently used open source system benchmark to measure OS jitter and has some similarities to VPSUM. Both are MPI applications that are tuned to provide tight compute loops over a time period. PSNAP provides more built in tunables such as automatic converging to a specified granularity or quantum (i.e. 1ms, 10ms, etc), barrier count control, and warm up period specification.

Rather than providing detailed iteration output, psnap provides a histogram binned by rank, time, count and hostname.

my_count= 94161 global_min= 94161 min_loc= 0 global_max= 94161 max_loc= 0
Using Global max for calibration
# 0 5016986 p6ihhpc5
0 100 37174 p6ihhpc5
0 101 11265 p6ihhpc5
0 102 979 p6ihhpc5
0 103 57 p6ihhpc5
0 104 33 p6ihhpc5
0 105 152 p6ihhpc5
0 106 186 p6ihhpc5
0 107 41 p6ihhpc5
0 108 27 p6ihhpc5
0 109 18 p6ihhpc5

PSNAP is very useful for obtaining an overall jitter factor - however, the lack of detailed iteration by rank makes it difficult to identify when event spikes occur. vpsum on the other hand, is more suited for determining OS jitter patterns - particularly when vpsum iteration time stamps are reconciled to the OSjitter tool event logs.

 


OS Jitter Mitigation Techniques

The following table list various OS jitter mitigation techniques which can be implemented. Most of the OSjitter events were determined using the OS jitter tool. The table lists the mitigation technique applied and provides commands and relevant remarks. Some of the techniques have additional discussions which follow the table.

 


Basic Mitigation Techniques

 

In these examples, we profiled SLES 11 running on a POWER6 IBM Power 575 system. The average times listed are for relative reference purposes.

Action Command Sequence Remarks
Full core dedicated partitions HMC LPAR definition Shared CPU micro-partitions are not ideal for use with HPC applications requiring minimal CPU sharing
Disable postfix chkconfig postfix off 











/etc/init.d/postfix stop
Postfix is an alternative email program often installed by default with the OS 











Disable if not required. 











Seen as PID qmgr/pickup >100usec (avg. 47usecs)
Disable nscd chkconfig nscd off 











/etc/init.d/nscd stop
nscd is a daemon that caches name service requests 











Seen as PID nscd service - 7-11 Usecs
Disable cron chkconfig cron off 











/etc/init.d/cron stop
Alternatively, can remove unnecessary cron scripts in /etc/cron* 











Seen as CRON pid with avg of 48usecs
Disable CDROM polling hal-disable-polling --device /dev/sr Seen as 18-22us "PID hald-addon-stor" events
Set smt-snooze-delay to very large 











value - in effect disabling the 











cede to PHYP
ppc64-cpu --smt-snooze-delay=4000000000 Default snooze-delay is 100. 











Setting to high value keep CPUs threads spinning 











which does not allow CPUs to cede to PHYP if determined idle 











(see General Discussions)

 


Linux scheduler mitigation techniques

 

Action Command Sequence Remarks
Disable irqbalancer chkconfig irq_balancer off 











/etc/init.d/irq_balancer stop
Daemon that distributes IRQ interrupts over CPU cores 











see General Discussions 











Max >1ms "PID irqbalance" OSjitter (avg 148usecs)
Disable realtime bandwidth reservation echo -1 > /proc/sys/kernel/sched_rt_runtime_us Limits the CPU time allocated towards scheduling group threads and disables realtime bandwidth reservation
Slow down vmstat update interval echo 20 > /proc/sys/vm/stat_interval Seen as "EVENT vmstat_update/cache_reap" events 











Can use lower or higher values (i.e. 5/10/30) 











Default is update stats every 1 seconds
Reduce hung_task_check_count echo 1 > /proc/sys/kernel/hung_task_check_count Limits the number of tasks checked by hung_task 











daemon [khungtaskd]. Default on IH nodes is 4194304.
Disable software watchdog echo -1 > /proc/sys/kernel/softlockup_thresh By default, SLES11 looks for 











hung tasks every 60 seconds.
Change CPU frequency governor to "performance" for file in $(find /sys/devices/system/cpu/*/cpufreq/scaling_govenor;do











echo performance > $file;done
Default value "ondemand" cannot always 











respond to loads by real-time 











tasks since it may not get run.
SMT enabled - all application processes affinitized to odd CPUs commands/scripts depends on application and launch method see General Discussions

 


Single thread Mitigation

 

Action Command Sequence Remarks
Set IRQ interrupt affinity to CPU 0 for i in /proc/irq/*/smp_affinity;do











echo 1 > $i; done
see General Discussions
set IRQ interrupt affinity to Thread 0 on all cpus for i in /proc/irq/*/smp_affinity;do











echo "55555555,55555555" > $i
Cannot use with setting all interrupts to CPU 0 











see General Discussions
reduce size of IP route cache hash table add "rhash_entries=8192" to boot command line arguments Seen as "EVENT rt_worker_func" events 











max can be >14ms
remove CPUs from scheduler consideration add "isolcpus=1,3,5..(odd cpus)" to boot command line arguments If using affinity for MPI tasks (on odd CPUs), the specified CPUs are removed from general kernel SMP balancing and scheduler algorithms. CPU affinity syscalls are the only way to move tasks onto the isolated CPUs

 


Engineering level Mitigation

 

Action Command Sequence Remarks
disable compugard gcc -o disable_compugard disable_compugard.c -lrtas 











./disable_compugard
Compugard is an rtas service that monitors system events from PHYP 











Download compugard utilities. 











See general Discussions
elevate application threads to real-time priority chrt -p 80 PID May or may not provide 











benefit. Argument against rt priority is that the application can potentially starve needed kernel services

 


General Discussion on Several Mitigation Techniques

 

Some of the jitter mitigation techniques such as affinitizing applications and IRQ interrupts to CPUs were initially identified in previous best practices.

 


Affinitizing processes to odd CPUs

 

Best practices have shown (for programs similar to vpsum), that affinitizing the MPI application threads to the odd CPUs provides the best mitigation results. Various methods can be used to affine the MPI tasks. For IBMPE, a launcher program was used to bind the processes.

On OPENMPI v1.2, a single mpirun initiated a command stream sequence where each process was separately launched onto the appropriate cpu such as follows:

mpirun -np 1 --host node1 taskset -c 1 ./vpsum ... --host node1 taskset -c 3 ./vpsum ... --host node2 taskset -c 1 .....

OPENMPI v1.3 provides robust machine/rankfile features easier suited to launching mpirun with process affinity. For example, a much more condensed command stream is shown bellow:

mpirun -np 32 --hostfile machines --rankfile rankfile ./vpsum ...

A third option is to execute affinity binding scripts on all the nodes after the processes have been initiated. The PIDs can be traversed and then set to the odd cpu, such as follows:

cpu=1 # start
procs=1
for x in `ps -ef | grep "./xhpl" | grep -v grep | awk '{print $2}'`;do
taskset -pc $cpu $x
mod=$(($procs % 2))
if [ $mod -eq 0 ];then
        let cpu=cpu+2
fi
let procs=procs+1
done

 


IRQ Interrupts

 

Best practices have shown that IRQ interrupts isolated to thread 0 (T0) on each core provide the best results. The explanation is that if IRQ interrupts occur on the same thread as the application thread, the application task is preempted to service the interrupt.

Alternatively, all IRQ interrupts can be assigned to CPU0. We saw slightly better results with this technique. The reasoning is that if IRQ interrupts occur on the same core as the application thread, there is still disruption for the application thread.

However, binding all the interrupts to CPU0 could cause a potential performance problem in an environment where IRQ interrupts are heavy, such as applications with high I/O or network activity.

 


SMT Snooze Delay

 

P6 machines have software and firmware layers between the OS and hardware resources. Known as the hypervisor or PHYP, these layers can provide memory, cpu and device services to the Linux OS. If a CPU thread in not doing useful work, the thread can be ceded (calls h_cede) to the hypervisor. For environments where virtual partitions share resources, this is beneficial and allows the other partitions to make use of idle CPUs. The snooze delay by default on SLES11 is 100 usecs. By setting this value extremely high (i.e. 4 billion microseconds), the thread avoids the h_cede call.

 


Disabling compugard

 

Within the engineering team, code was developed to disable compugard on Power6 systems. We plan on making these tools available shortly. The programs available include disable_compugard, set_compugard and get_compugard. In order to disable the compugard, the partition must have appropriate permission.

To check the current compugard setting, use "get_compugard" which returns either a 0 (disabled) or 1 (enabled).

To disable compugard, use "disable_compugard".

To re-enable compugard, use "set_compugard 1".

 


VPSUM OSjitter Results

We look at several different graphs representing OS jitter using the vpsum benchmark. We normally use a test harness that lets us setup automated runs and the mpirun command is executed within a fairly complex script. The graphs are non-mitigated, mitigated (both within our test harness environment), mitigated with no test harness and a final graph executing mpirun directly from the shell (no bash script).

 


10ms Non-mitigated run

 

In the above graph, we see some very large variations. After applying our mitigation techniques as shown in the following graph, most of our variations are significantly reduced.

 


10ms Mitigated run

 

 


10ms Mitigated with no test harness

 

When we remove our test harness, we see an even further reduction in variations. Now we are left with just a bash script that we pass a variety of parameters to.

 


10ms Mitigated with no script (launch ./mpirun directly without using any scripts)

 

And finally, for the above graph, we execute the vpsum program directly from the mpirun command (without any test harness or shell script). Notice the final improvements. A main point with the above charts shows that bash scripting, can be a primary source of OSjitter.

 


Results from NAS Benchmark Suite (Conjugate Gradient test)

We illustrate the effects of mitigating OS jitter on the NAS Benchmark Suite (Conjugate Radiate test) using 4 compute nodes (32 cores per node). This particular workload is a good candidate for mitigation techniques because of the high computational and multiple CPU process characteristics. With the four nodes, a total of 128 processes were used.

The variations were as follows:

  • Non-mitigated
  • CPU affinity (each process assigned to a specific CPU)
  • CPU affinity plus all mitigation techniques

We ran the workload in sets of three consecutive runs. The results are as follows:

Run Sequence Technique Duration Mop/s Technique Duration Mop/s Technique Duration Mops/S
1 Non-mitigated 7.1 20176.26 CPU Affinity 5.11 28067.27 CPU Affinity + 











Mitigation
5.14 27908.99
2 " 355.41 403.33 " 5.15 27813.68 " 5.14 27908.23
3 " 109.38 1310.57 " 5.14 27900.57 " 5.15 27847.47

Significant run-time variations occur when processes are launched without using CPU affinity.

Using affinity (OpenMPI v1.3, machine and rankfiles) provides more consistent durations and results.

Lastly, employing all the mitigation techniques discussed provides the most consistent results.