The
XLSMPOPTS environment variable allows
you to specify options that affect SMP execution. You
can declare
XLSMPOPTS by using the following
bash command
format:
.-:-------------------------------------------.
V |
>>-XLSMPOPTS=--+---+----runtime_option_name-- =----option_setting---+--+---+-><
'-"-' '-"-'
You can specify option names and settings in uppercase or lowercase.
You can add blanks before and after the colons and equal signs to
improve readability. However, if the XLSMPOPTS option
string contains imbedded blanks, you must enclose the entire option
string in double quotation marks (").
You can specify the following runtime options with the
XLSMPOPTS environment
variable:
- schedule
- Selects the scheduling type and chunk size to be used as the default
at run time. The scheduling type that you specify will only be used
for loops that were not already marked with a scheduling type at compilation
time.
Work is assigned to threads in a different manner, depending
on the scheduling type and chunk size used. A brief description of
the scheduling types and their influence on how work is assigned follows:
- dynamic or guided
- The runtime library dynamically schedules parallel work for threads
on a "first-come, first-do" basis. "Chunks" of the remaining work
are assigned to available threads until all work has been assigned.
Work is not assigned to threads that are asleep.
- static
- Chunks of work are assigned to the threads in a "round-robin"
fashion. Work is assigned to all threads, both active and asleep.
The system must activate sleeping threads in order for them to complete
their assigned work.
- affinity
- The runtime library performs an initial division of the iterations
into number_of_threads partitions. The number
of iterations that these partitions contain is:
CEILING(number_of_iterations / number_of_threads)
These
partitions are then assigned to each of the threads. It is these partitions
that are then subdivided into chunks of iterations. If a thread is
asleep, the threads that are active will complete their assigned partition
of work.
Choosing chunking granularity is a tradeoff between overhead
and load balancing. The syntax for this option is
schedule=
suboption,
where the suboptions are defined as follows:
- affinity[=n]
- As described previously, the iterations of a loop are initially
divided into partitions, which are then preassigned to the threads.
Each of these partitions is then further subdivided into chunks that
contain n iterations. If you have not specified n,
a chunk consists of CEILING(number_of_iterations_left_in_local_partition
/ 2) loop iterations.
When a thread becomes available, it takes
the next chunk from its preassigned partition. If there are no more
chunks in that partition, the thread takes the next available chunk
from a partition preassigned to another thread.
- auto
- With auto, scheduling is delegated to the compiler
and runtime system. The compiler and runtime system can choose any
possible mapping of iterations to threads (including all possible
valid schedules) and these may be different in different loops. Do
not specify chunk size (n) when you use auto.
If chunk size (n) is specified, the compiler issues a severe
error message.
Note: When both the -qsmp=schedule option
and OMP_SCHEDULE are used, the option will override
the environment variable.
- dynamic[=n]
- The iterations of a loop are divided into chunks that contain n iterations
each. If you have not specified n, a chunk
consists of CEILING(number_of_iterations / number_of_threads) iterations.
- guided[=n]
- The iterations of a loop are divided into progressively smaller
chunks until a minimum chunk size of n loop
iterations is reached. If you have not specified n,
the default value for n is 1 iteration.
The first chunk contains CEILING(number_of_iterations / number_of_threads)
iterations. Subsequent chunks consist of CEILING(number_of_iterations_left
/ number_of_threads) iterations.
- static[=n]
- The iterations of a loop are divided into chunks that contain n iterations.
Threads are assigned chunks in a "round-robin" fashion. This is known
as block cyclic scheduling. If the value of n is
1, the scheduling type is specifically referred to as cyclic scheduling.
If you have not specified n, the chunks
will contain CEILING(number_of_iterations / number_of_threads) iterations.
Each thread is assigned one of these chunks. This is known as block
scheduling.
If you have not specified schedule,
the default is set to schedule=static, resulting
in block scheduling. For more information, see the description of
the SCHEDULE directive
in the XL Fortran Language
Reference.
- Parallel execution options
-
- parthds=num
- Specifies the number of threads (num)
to be used for parallel execution of code that you compiled with the -qsmp option.
By default, this is equal to the number of online processors. There
are some applications that cannot use more than some maximum number
of processors. There are also some applications that can achieve performance
gains if they use more threads than there are processors.
This
option allows you full control over the number of execution threads.
The default value for num is 1 if you did
not specify -qsmp. Otherwise, it is the
number of online processors on the machine. For more information,
see the NUM_PARTHDS intrinsic
function.
- usrthds=num
- Specifies the maximum
number of threads (num) that you expect
your code will explicitly create if the code does explicit thread
creation. The default value for num is 0. For more information, see the NUM_PARTHDS intrinsic
function in the XL Fortran Language
Reference.
- stack=num
- Specifies the largest
amount of space in bytes (num) that a thread's
stack will need. The default value for num is
4194304.
Set stack=num so
it is within the acceptable upper limit. num can
be up to the limit imposed by system resources or the stack size
ulimit, whichever is smaller. An application that exceeds the upper
limit may cause a segmentation fault.
- stackcheck[=num]
- Enables stack overflow checking for worker threads at
runtime. num is the size in bytes that you specify; when the
remaining stack size is less than num, a runtime warning message
is issued. If you do not specify a value for num, the default
value is 4096 bytes. Note that this option only has an effect when -qsmp=stackcheck has
also been specified at compile time. See -qsmp for
more information.
- startproc=cpu_id
- Enables thread binding and specifies the cpu_id to
which the first thread binds. If the value provided is outside the
range of available processors, the SMP run time issues a warning message
and no threads are bound.
- procs=cpu_id[,cpu_id,...]
- Enables thread binding and specifies a list of cpu_id to
which the threads are bound. If the number of CPU IDs specified is
less than the number of threads used by the program, the remaining
threads are not bound.
- stride=num
- Specifies the increment used to determine the cpu_id to
which subsequent threads bind. num must
be greater than or equal to 1. If the value provided causes a thread
to bind to a CPU outside the range of available processors, a warning
message is issued and no threads are bound.
- Performance tuning options
- When a thread
completes its work and there is no new work to do, it can go into
either a "busy-wait" state or a "sleep" state. In "busy-wait", the
thread keeps executing in a tight loop looking for additional new
work. This state is highly responsive but harms the overall utilization
of the system. When a thread sleeps, it completely suspends execution
until another thread signals it that there is work to do. This state
provides better utilization of the system but introduces extra overhead
for the application.
The xlsmp runtime
library routines use both "busy-wait" and "sleep" states in their
approach to waiting for work. You can control these states with the spins, yields,
and delays options.
During the busy-wait search for
work, the thread repeatedly scans the work queue up to num times,
where num is the value that you specified
for the option spins. If a thread cannot
find work during a given scan, it intentionally wastes cycles in a
delay loop that executes num times, where num is
the value that you specified for the option delays.
This delay loop consists of a single meaningless iteration. The length
of actual time this takes will vary among processors. If the value spins is
exceeded and the thread still cannot find work, the thread will yield
the current time slice (time allocated by the processor to that thread)
to the other threads. The thread will yield its time slice up to num times,
where num is the number that you specified
for the option yields. If this value num is
exceeded, the thread will go to sleep.
In summary, the ordered
approach to looking for work consists of the following steps:
- Scan the work queue for up to spins number
of times. If no work is found in a scan, then loop delays number
of times before starting a new scan.
- If work has not been found, then yield the current time slice.
- Repeat the above steps up to yields number
of times.
- If work has still not been found, then go to sleep.
The syntax for specifying these options is as follows:
- spins[=num]
- where num is the number of spins before
a yield. The default value for spins is 100.
- yields[=num]
- where num is the number of yields before
a sleep. The default value for yields is 10.
- delays[=num]
- where num is the number of delays while
busy-waiting. The default value for delays is 500.
Zero is a special value for spins and yields,
as it can be used to force complete busy-waiting. Normally, in a benchmark
test on a dedicated system, you would set both options to zero. However,
you can set them individually to achieve other effects.
For
instance, on a dedicated 8-way SMP, setting these options to the
following:
parthds=8 : schedule=dynamic=10 : spins=0 : yields=0
results
in one thread per CPU, with each thread assigned chunks consisting
of 10 iterations each, with busy-waiting when there is no immediate
work to do.
- Options to enable and control dynamic profiling
- You can use dynamic profiling to reevaluate the compiler's decision
to parallelize loops in a program. The three options you can use to
do this are: parthreshold, seqthreshold,
and profilefreq.
- parthreshold=num
- Specifies the
time, in milliseconds, below which each loop must execute serially.
If you set parthreshold to 0, every loop
that has been parallelized by the compiler will execute in parallel.
The default setting is 0.2 milliseconds, meaning that if a loop requires
fewer than 0.2 milliseconds to execute in parallel, it should be serialized.
Typically, parthreshold is set to be
equal to the parallelization overhead. If the computation in a parallelized
loop is very small and the time taken to execute these loops is spent
primarily in the setting up of parallelization, these loops should
be executed sequentially for better performance.
- seqthreshold=num
- Specifies the
time, in milliseconds, beyond which a loop that was previously serialized
by the dynamic profiler should revert to being a parallel loop. The
default setting is 5 milliseconds, meaning that if a loop requires
more than 5 milliseconds to execute serially, it should be parallelized.
seqthreshold acts as the reverse of parthreshold.
- profilefreq=num
- Specifies the
frequency with which a loop should be revisited by the dynamic profiler
to determine its appropriateness for parallel or serial execution.
Loops in a program can be data dependent. The loop that was chosen
to execute serially with a pass of dynamic profiling may benefit from
parallelization in subsequent executions of the loop, due to different
data input. Therefore, you need to examine these loops periodically
to reevaluate the decision to serialize a parallel loop at run time.
The allowed values for this option are the numbers from 0 to 32.
If you set
profilefreq to one of these values,
the following results will occur.
- If profilefreq is 0, all profiling is
turned off, regardless of other settings. The overheads that occur
because of profiling will not be present.
- If profilefreq is 1, loops parallelized
automatically by the compiler will be monitored every time they are
executed.
- If profilefreq is 2, loops parallelized
automatically by the compiler will be monitored every other time they
are executed.
- If profilefreq is greater than or equal
to 2 but less than or equal to 32, each loop will be monitored once
every nth time it is executed.
- If profilefreq is greater than 32, then
32 is assumed.
It is important to note that dynamic profiling is not
applicable to user-specified parallel loops (for example, loops for
which you specified the PARALLEL DO directive).