XLSMPOPTS

Runtime options affecting parallel processing can be specified with the XLSMPOPTS environment variable. This environment variable must be set before you run an application, and uses basic syntax of the form:


                         .-:-------------------------------------------.          
                         V                                             |          
>>-XLSMPOPTS-- = -+---+----runtime_option_name-- = ---option_setting---+--+---+-><
                  '-"-'                                                   '-"-'

You can specify option names and settings in uppercase or lowercase. You can add blanks before and after the colons and equal signs to improve readability. However, if the XLSMPOPTS option string contains imbedded blanks, you must enclose the entire option string in double quotation marks (").

For example, to have a program run time create 4 threads and use dynamic scheduling with chunk size of 5, you would set the XLSMPOPTS environment variable as shown below:

XLSMPOPTS=PARTHDS=4:SCHEDULE=DYNAMIC=5

The following are the available runtime option settings for the XLSMPOPTS environment variable:

Scheduling options are as follows:

schedule

Specifies the type of scheduling algorithms and chunk size (n) that are used for loops to which no other scheduling algorithm has been explicitly assigned in the source code.

Work is assigned to threads in a different manner, depending on the scheduling type and chunk size used. Choosing chunking granularity is a tradeoff between overhead and load balancing. The syntax for this option is schedule=suboption, where the suboptions are defined as follows:

affinity[=n]

The iterations of a loop are initially divided into n partitions, containing ceiling(number_of_iterations/number_of_threads) iterations. Each partition is initially assigned to a thread and is then further subdivided into chunks that each contain n iterations. If n is not specified, then the chunks consist of ceiling(number_of_iterations_left_in_partition / 2) loop iterations.

When a thread becomes free, it takes the next chunk from its initially assigned partition. If there are no more chunks in that partition, then the thread takes the next available chunk from a partition initially assigned to another thread.

The work in a partition initially assigned to a sleeping thread will be completed by threads that are active.

The affinity scheduling type is not part of the OpenMP API standard.

Note: This suboption has been deprecated and might be removed in a future release. You can use the guided suboption for a similar functionality.

dynamic[=n]

The iterations of a loop are divided into chunks that contain n contiguous iterations each. The final chunk might contain fewer than n iterations. If n is not specified, the default chunk size is one.

Each thread is initially assigned one chunk. After threads complete their assigned chunks, they are assigned remaining chunks on a "first-come, first-do" basis.

guided[=n]

The iterations of a loop are divided into progressively smaller chunks until a minimum chunk size of n loop iterations is reached. If n is not specified, the default value for n is 1 iteration.

Active threads are assigned chunks on a "first-come, first-do" basis. The first chunk contains ceiling(number_of_iterations/number_of_threads) iterations. Subsequent chunks consist of ceiling(number_of_iterations_left / number_of_threads) iterations. The final chunk might contain fewer than n iterations.

static[=n]

The iterations of a loop are divided into chunks containing n iterations each. Each thread is assigned chunks in a "round-robin" fashion. This is known as block cyclic scheduling. If the value of n is 1, then the scheduling type is specifically referred to as cyclic scheduling.

If n is not specified, the chunks will contain floor(number_of_iterations/number_of_threads) iterations. The first remainder(number_of_iterations/number_of_threads) chunks have one more iteration. Each thread is assigned one of these chunks. This is known as block scheduling.

If a thread is asleep and it has been assigned work, it will be awakened so that it may complete its work.

n

Must be an integral assignment expression of value 1 or greater.

Specifying schedule with no suboption is equivalent to schedule=runtime.

Parallel environment options are as follows:

parthds=num

Specifies the number of threads (num) requested, which is usually equivalent to the number of processors available on the system.

Some applications cannot use more threads than the maximum number of processors available. Other applications can experience significant performance improvements if they use more threads than there are processors. This option gives you full control over the number of user threads used to run your program.

The default value for num is the number of processors available on the system.

usrthds=num

Specifies the maximum number of threads (num) that you expect your code will explicitly create if the code does explicit thread creation. The default value for num is 0.

stack=num

Specifies the largest amount of space in bytes (num) that a thread's stack needs. The default value for num is 4194304.

Set num so it is within the acceptable upper limit. num can be up to 256 MB for 32-bit mode, or up to the limit imposed by system resources for 64-bit mode. An application that exceeds the upper limit may cause a segmentation fault.

stackcheck[=num]

When the -qsmp=stackcheck is in effect, enables stack overflow checking for slave threads at runtime. num is the size of the stack in bytes, and it must be a nonzero positive number. When the remaining stack size is less than this value, a runtime warning message is issued. If you do not specify a value for num, the default value is 4096 bytes. Note that this option only has an effect when the -qsmp=stackcheck has also been specified at compile time. See -qsmp for more information.

startproc=cpu_id

Enables thread binding and specifies the cpu_id to which the first thread binds. If the value provided is outside the range of available processors, a warning message is issued and no threads are bound.

procs=cpu_id[,cpu_id,...]

Enables thread binding and specifies a list of cpu_id to which the threads are bound.

stride=num

Specifies the increment used to determine the cpu_id to which subsequent threads bind. num must be greater than or equal to 1. If the value provided causes a thread to bind to a CPU outside the range of available processors, a warning message is issued and no threads are bound.

bind=SDL=n1,n2,n3

Specifies different system detail levels to bind threads by using the Resource Set API. This suboption supports binding a thread to multiple logical processors.

SDL stands for System Detail Level and can be MCM, L2CACHE, PROC_CORE, or PROC. If the SDL value is not specified, or an incorrect SDL value is specified, the SMP runtime issues an error message.

The list of three integers n1,n2,n3 determines how to divide threads among resources (one of SDLs). n1 is the starting resource_id, n2 is the number of requested resources, and n3 is the stride, which specifies the increment used to determine the next resource_id to bind. n1,n2,n3 must all be specified; otherwise, the SMP runtime issues an error message and default binding rules apply.

When the number of resources specified in bind is greater than the number of threads, the extra resources are ignored.

When the number of threads t is greater than the number of resources x, t threads are divided among x resources according to the following formula:

The ceil(t/x) threads are bound to the first (t mod x) resources. The floor(t/x) threads will be bound to the remaining resources.

With the XLSMPOPTS environment variable being set as in the following example, a program runs with 16 threads. It binds threads to PROC 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30.

XLSMPOPTS="bind=PROC=0,16,2"

Notes:

The bind suboption takes precedence over the startproc/stride and procs suboptions. However, bindlist takes precedence over bind.
Resource Set can only be used by a user account with the CAP_NUMA_ATTACH and CAP_PROPAGATE capabilities. These capabilities are set on a per-user basis by using the chuser command as follows:
```
chuser "capabilities=CAP_PROPAGATE,CAP_NUMA_ATTACH" username
```
If the resource_id specified in bind is outside the range of 0 to INT32_MAX, where INT32_MAX is 2147483647 as defined in stdint.h, the SMP runtime issues an error message and default binding rules apply.
The SMP runtime verifies that the resource_id exists. If the resource_id does not exist, a warning message is issued and the thread is left unbound.
If you change the number of threads inside the program, for example, through omp_set_num_threads() or num_threads clause, the following situation occurs:
- If the number of threads in the application is increased, rebinding takes place based on the environment variable settings.
- If the number of threads is reduced after binding, the original binding remains.

bindlist=SDL=i1,i2,...ix

Specifies different system detail levels to bind threads by using the Resource Set API. This suboption supports binding a thread to multiple logical processors.

The list of x integers i1,i2...ix enumerates the resources (one of SDLs) to be used during binding. When the number of integers in the list is greater than or equal to the number of threads, the position in the list determines the thread ID that will be bound to the resource.

When the number of resources specified in bindlist is greater than the number of threads, the extra resources are ignored.

When the number of threads t is greater than the number of resources x, t threads will be divided among x resources according to the following formula:

The ceil(t/x) threads are bound to the first (t mod x) resources. The floor(t/x) threads will be bound to the remaining resources.

For example:

XLSMPOPTS="bindlist=MCM=0,1,2,3"

This example code shows that threads are bound to MCM 0,1,2,3. When the program runs with four threads, thread 0 is bound to MCM 0, thread 1 is bound to MCM 1, thread 2 is bound to MCM 2, and thread 3 is bound to MCM 3. When the program runs with six threads, threads 0 and 1 are bound to MCM 0, threads 2 and 3 are bound to MCM 1, thread 4 is bound to MCM 2, and thread 5 is bound to MCM 3.

With the XLSMPOPTS environment variable being set as in the following example, a program runs with eight (or fewer) threads. It binds all even-numbered threads to L2CACHE 0 and all odd-numbered threads to L2CACHE 1.

XLSMPOPTS="bindlist=L2CACHE=0,1,0,1,0,1,0,1"

Notes:

The bindlist suboption takes precedence over the startproc/stride, procs, and bind suboptions.
Resource Set can only be used by a user account with the CAP_NUMA_ATTACH and CAP_PROPAGATE capabilities. These capabilities are set on a per-user basis by using the chuser command as follows:
```
chuser "capabilities=CAP_PROPAGATE,CAP_NUMA_ATTACH" username
```
The SMP runtime verifies that the thread ID specified for a resource is not less than 0 nor greater than the available resources. Otherwise, the SMP runtime issues a warning message and the thread is left unbound.
If you change the number of threads inside the program, for example, through omp_set_num_threads() or num_threads clause, the following situation occurs:
- If the number of threads in the application is increased, rebinding takes place based on the environment variable settings.
- If the number of threads is reduced after binding, the original binding remains.

Performance tuning options are as follows:

spins=num

Specifies the number of loop spins, or iterations, before a yield occurs.

When a thread completes its work, the thread continues executing in a tight loop looking for new work. One complete scan of the work queue is done during each busy-wait state. An extended busy-wait state can make a particular application highly responsive, but can also harm the overall responsiveness of the system unless the thread is given instructions to periodically scan for and yield to requests from other applications.

A complete busy-wait state for benchmarking purposes can be forced by setting both spins and yields to 0.

The default value for num is 100.

yields=num

Specifies the number of yields before a sleep occurs.

When a thread sleeps, it completely suspends execution until another thread signals that there is work to do. This provides better system utilization, but also adds extra system overhead for the application.

The default value for num is 100.

delays=num

Specifies a period of do-nothing delay time between each scan of the work queue. Each unit of delay is achieved by running a single no-memory-access delay loop.

The default value for num is 500.

Dynamic profiling options are as follows:

profilefreq=n

Specifies the frequency with which a loop should be revisited by the dynamic profiler to determine its appropriateness for parallel or serial execution. The runtime library uses dynamic profiling to dynamically tune the performance of automatically parallelized loops. Dynamic profiling gathers information about loop running times to determine if the loop should be run sequentially or in parallel the next time through. Threshold running times are set by the parthreshold and seqthreshold dynamic profiling options, described below.

The allowed values for this option are the numbers from 0 to 32. If num is 0, all profiling is turned off, and overheads that occur because of profiling will not occur. If num is greater than 0, running time of the loop is monitored once every num times through the loop. The default for num is 16. Values of num exceeding 32 are changed to 32.

It is important to note that dynamic profiling is not applicable to user-specified parallel loops.

parthreshold=num

Specifies the time, in milliseconds, below which each loop must execute serially. If you set num to 0, every loop that has been parallelized by the compiler will execute in parallel. The default setting is 0.2 milliseconds, meaning that if a loop requires fewer than 0.2 milliseconds to execute in parallel, it should be serialized.

Typically, num is set to be equal to the parallelization overhead. If the computation in a parallelized loop is very small and the time taken to execute these loops is spent primarily in the setting up of parallelization, these loops should be executed sequentially for better performance.

seqthreshold=num

Specifies the time, in milliseconds, beyond which a loop that was previously serialized by the dynamic profiler should revert to being a parallel loop. The default setting is 5 milliseconds, meaning that if a loop requires more than 5 milliseconds to execute serially, it should be parallelized.

seqthreshold acts as the reverse of parthreshold.

Voice your opinion on getting help information

Ask IBM compiler experts a technical question in the IBM XL compilers forum