Contents


OpenMP support in IBM XL compilers

Comments

The IBM® XL compilers facilitate applications using the characteristics of the hardware platform to maximize performance. The compilers also support performance-specific features to assist programmers to optimize and tune their applications. Because most of the modern computers consist of multicore hardware, parallelization becomes one of the common approaches to improve application performance.

The OpenMP application program interface (API) is an industry standard that allows users to annotate their sequential code by pragmas and directives to use parallelism. The base languages of the specification are Fortran, C, and C++. This specification is developed by the OpenMP language committee, which is represented by vendors, users and researchers. For more information, refer to the OpenMP website.

OpenMP support in XL C/C++ and XL Fortran compilers for Linux on Power little endian

IBM XL C/C++ V13.1.2 and XL Fortran V15.1.2 for Linux on Power little endian supports the OpenMP API V3.1 specification for parallel programming and also selected features in the latest (at the time of this publication) OpenMP API V4.0. OpenMP provides a simple and flexible interface for parallel application development. The OpenMP specification consists of three components: compiler directives/pragmas, runtime library functions, and environment variables. Applications that conform to the OpenMP specification are portable across platforms. This specification supports applications that run both as parallel programs (multiple threads of execution and a full OpenMP support library) and as sequential programs (directives / pragmas are ignored and the stub library is linked).

The OpenMP parallelization is enabled by the –qsmp compiler option. If –qsmp=omp is specified, strict OpenMP compliance is applied on the compiling programs. For further information, refer to the Compiler Reference of the XL compilers (in the Related topics section) .

What’s new in OpenMP V3.1

The OpenMP API V3.1 adds extensions of the existing features that allow users to fine tune their applications. This revision also relaxes some restrictions to provide more flexibility to express different scenarios in parallel programming.

Task parallelism extensions

In the early version of the OpenMP specification, the parallelism is mostly regular, for example, the loop parallelism that the number of iterations can be determined or the parallel sections construct that the number of independent sections is fixed. The task parallelism was introduced in V3.0 to support irregular parallelism. The task construct enables parallelization of irregular algorithms, such as pointer chasing or recursive algorithms. However, when the problem size becomes smaller and smaller, the cost of creating tasks to compute may become significant compared to the computation of the task itself. The final and mergeable clauses are introduced to control whether a task needs to be run immediately or its data environment needs to be created or not.

Example 1: Generate final tasks

#pragma omp parallel
{
    #pragma omp single
    {
     while (list) {
         #pragma omp task final(list->size < THRESHOLD)
         {
         compute(list->next);
         }
    }
}
}

Example 1 shows that tasks are generated and run in a parallel region. If the size is smaller than THRESHOLD, the generated task is a final task. This final task is run immediately. This does not require any scheduling cost.

Example 2: Generate mergeable tasks

void compute(struct S *p)
{
    #pragma omp task final(level < DEPTH) mergeable
    {
    compute(p->next);
    }
}

In Example 2, the recursive function is to traverse the list to compute. When it traverses to a certain level, it might not be worthwhile to create a new task due to the small compute size. Specifying the mergeable clause is to instruct the compiler not to create a new data environment for the mergeable task to reduce the cost of task generation cost. These two new extensions allow fine tuning for performance in task parallelism to reduce the cost when the compute size gets smaller.

Nested parallelism

The OMP_NUM_THREADS environment variable specifies the number of threads being used in parallel regions. However, if there is a nested parallel region, there is no direct way to control the number of threads being used in the inner parallel regions. If the programmers are not careful, an oversubscription of threads might occur and it impacts the performance. In OpenMP V3.1, the OMP_NUM_THREADS environment variable is extended to specify the number of threads used in nested parallel regions. The values can be specified by a comma-separated list, for example, OMP_NUM_THREAD=8,4. A group of query routines is added, for example, the level of nested parallel regions (omp_get_level), the ancestor thread number (omp_get_ancestor_thread_num), the team size of a given level (omp_get_team_size) and the level of nested active parallel regions (omp_get_active_level). Runtime routines (omp_set_max_active_levels and omp_get_max_active_levels) for getting and setting the maximum number of active levels are also included.

Example 3: Nested parallel constructs

#include <omp.h>

int main() {
#pragma omp parallel
    {
    if (omp_get_thread_num()==0) printf("outer parallel: %d\n", omp_get_num_threads());
    
     #pragma omp parallel
     {
     if (omp_get_thread_num()==0)
        printf("inner parallel: %d\n", omp_get_num_threads());
     }
    }
}

Compiling Example 3 with the XL C/C++ compiler gives the output shown in Example 4.

Example 4: Compilation and output of the nested parallel program

$ xlc –qsmp nested_par.c –o nested_par
$ export OMP_NESTED=true
$ export OMP_NUM_THREADS=4,2
$ ./nested_par
outer parallel: 4
inner parallel: 2
inner parallel: 2
inner parallel: 2
inner parallel: 2

As specified in the OMP_NUM_THREADS environment variable, "4,2", the outer parallel region is run by four threads and the inner region is run by two threads.

In addition, the XL C/C++ V13.1.2 and XL Fortran V15.1.2 for Linux on Power little endian compilers support the OpenMP style nested parallelism that is enabled by setting the OMP_NESTED environment variable to true. By default, the nested parallelism is disabled. This setting is for the whole program. Programmers can call the omp_set_nested runtime routine to selectively enable nested parallelism on certain parallel regions in the code.

With this feature, applications can easily be adjusted based on the running environment to avoid creating more threads than expected to impact the performance.

Atomic construct extension

The atomic construct now includes more atomic operations. The read, write and capture clauses are added to support the read, write and capture operations respectively. The existing atomic update form can also be expressed by using the update clause.

Refer to Example 5 for some of the atomic constructs.

Example 5: OpenMP atomic operations

! atomic read of variable x
!$omp atomic read
v = x
!$omp end atomic

! atomic write of variable x
!$omp atomic write
x = y
!$omp end atomic

! atomic capture: pre-update value of x is captured and then updated
!$omp atomic capture
v = x
x = x + 1
!$omp end atomic

Thread bind policy

Some applications might require dedicated resource to achieve the required performance. Migrating threads from one processor to another processor might cause an unexpected performance impact. The OMP_PROC_BIND environment variable is introduced to allow users to enable or disable the running environment moving OpenMP threads between processors.

The IBM XL compilers have the startproc/stride or procs suboption of the XLSMPOPTS environment variable to allow finer control of how OpenMP threads are bound to the processors. This feature is an IBM extension. If portability of your application is important, use only the OMP_PROC_BIND environment variable to control the thread bind.

Miscellaneous enhancements

There are several enhancements to allow more flexibility in expressing parallelism in various scenarios.

In Fortran, intent(in) dummy arguments are allowed to be on the firstprivate clause. This can avoid creating temporary variables in the procedure before passing to the parallel construct. Fortran pointers are now allowed to be specified in the firstprivate and lastprivate clause.

In C/C++, reduction operators, min and max, are added to perform the corresponding operations.

Selected OpenMP V4.0 features and other enhancements

The XL C/C++ V13.1.2 and XL Fortran V15.1.2 for Linux on Power little endian add a few OpenMP V4.0 features – atomic construct extension and the OMP_DISPLAY_ENV environment variable.

Atomic construct extension

The atomic swap can be expressed by the atomic construct with the capture clause. The following example illustrates the atomic construct performing atomic swap where the original value of x is captured and then updated.

Example 6: An atomic swap operation

#pragma omp atomic capture
{
    v = x;
    x = y;
}

In addition, the atomic capture construct allows more forms of expressions. This allows users to have flexibility of expressing their code.

The OpenMP atomic in XL C/C++ V13.1.2 and XL Fortran V15.1.2 for Linux on Power little endian is reimplemented for better performance. The old implementation uses lock to ensure exclusive access of a memory location. The new implementation makes use of hardware instructions available on the IBM PowerPC® architecture that do not use the lock mechanism. As a result, the atomic operations are more efficient and might improve overall performance of the applications that have such operations.

Displaying OpenMP runtime settings

Implementations set the OpenMP internal control variables (ICVs) by default. These ICVs can be modified by setting the environment variables or calling runtime routines in the source code. Although you can query the settings of the ICVs by calling the corresponding runtime routines, this process involves inserting calls in the source and re-compiling the applications. It might be time consuming. OpenMP V4.0 adds the OMP_DISPLAY_ENV environment variable to have the OpenMP run time display the ICVs settings. This feature can help programmers to debug the code by examining the runtime settings. Programmers can also use this feature to check the version of the runtime library that is used during link time if it is statistically linked or during the run time if it is dynamically linked. Another scenario is that porting code might cause some unexpected behavior due to different default settings on the new platforms. This feature can also help programmers quickly compare the settings, identify the differences, and adjust accordingly.

Example 7: Using the OMP_DISPLAY_ENV environment variable to display runtime settings

$ export OMP_DISPLAY_ENV=true
$ ./a.out

OPENMP DISPLAY ENVIRONMENT BEGIN
OMP_DISPLAY_ENV='TRUE'

    _OPENMP='201107'
    OMP_DYNAMIC='FALSE'
    OMP_MAX_ACTIVE_LEVELS='5'
    OMP_NESTED='FALSE'
    OMP_NUM_THREADS='96'
    OMP_PROC_BIND='FALSE'
    OMP_SCHEDULE='STATIC,0'
    OMP_STACKSIZE='4194304'
    OMP_THREAD_LIMIT='96'
    OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END

In Example 7, the OMP_DISPLAY_ENV variable is set to true, and the OpenMP runtime displays all the default settings of the ICVs that have the corresponding environment variables. If the environment variable is set to verbose, more vendor specific settings are included in the display.

Example 8: Using the OMP_DISPLAY_ENV environment variable to display runtime settings including vendor specific information

$ export OMP_DISPLAY_ENV=verbose
$ ./a.out

OPENMP DISPLAY SWITCHES BEGIN
    LOMP_AUTO_PASSIVE_HALF_THREAD='1'
    LOMP_CACHE_LINE_SIZE='256'
    LOMP_CHECK_STACKS='1'
    LOMP_CLEANUP_ON_PROCESS_EXIT='0'
    LOMP_CLEANUP_TO_FORCE_RESCAN='0'
    LOMP_COUNTER_BARRIER='0'
    LOMP_DEBUG='0'
    LOMP_DEFAULT_DELAY='1000'
    LOMP_DEFAULT_SPIN='64'
    LOMP_DEFAULT_YIELD='64'
    LOMP_ENABLE_INLINING='1'
    LOMP_ENABLE_WAIT_PASSIVE_BARRIER='0'
    LOMP_ENABLE_WAIT_PASSIVE_WORKER='1'
    LOMP_FUSSY_INIT='0'
    LOMP_GUIDED_SHARED='1'
    LOMP_ILDE_THREAD_EXIT='0'
    LOMP_XL_LEGACY='0'
    LOMP_G_LEGACY='0'
    LOMP_AUTOPAR_LEGACY='1'
    LOMP_LOOP_CACHE='0'
    LOMP_MASTER_BARRIER_MSYNC='0'
    LOMP_MAX_THREAD='65535'
    LOMP_PARALEL_DISABLE_FAST_PATH='0'
    LOMP_PROC_BIND_40='1'
    LOMP_PROC_BIND_WHEN_OFF='0'
    LOMP_PROC_BIND_WHEN_ON='1'
    LOMP_SEQENTIAL_FAST='1'
    LOMP_TASK_DISABLE_STEAL='0'
    LOMP_TEST='1'
    LOMP_WAIT_LOW_PRIO='1'
    LOMP_WAIT_WITH_YIELD='1'
    OMPT_TIER='0'
    LOMP_TARGET_PPC='0'
    LOMP_TARGET_CUDA='0'
    LOMP_ARCH_POWER='8'
    LOMP_BARRIER_SWMR_DEGREE='2'
    LOMP_BARRIER_WITH_IO_SYNC='1'
OPENMP DISPLAY SWITCHES END

OPENMP DISPLAY RUNTIME BEGIN
    LOMP_VERSION='0.35 for OpenMP 3.1'
    BUILD_LEVEL='OpenMP Runtime Version: 13.1.2(C/C++) and 
    15.1.2(Fortran) Level: 150417 ID: _v1mpguSSEeSbzZ-i2Itj4A'
    TARGET='Linux, 64 bit LE'
OPENMP DISPLAY RUNTIME END

OPENMP DISPLAY ENVIRONMENT BEGIN
    OMP_DISPLAY_ENV='VERBOSE'
    
    _OPENMP='201107'
    OMP_DYNAMIC='FALSE'
    OMP_MAX_ACTIVE_LEVELS='5'
    OMP_NESTED='FALSE'
    OMP_NUM_THREADS='96'
    OMP_PROC_BIND='FALSE'
    OMP_SCHEDULE='STATIC,0'
    OMP_STACKSIZE='4194304'
    OMP_THREAD_LIMIT='96'
    OMP_WAIT_POLICY='PASSIVE'
    XLSMPOPTS=' DELAYS=1000'
    XLSMPOPTS=' NOSTACKCHECK'
    XLSMPOPTS=' PARTHDS=96'
    XLSMPOPTS=' PARTHRESHOLD=    inf'
    XLSMPOPTS=' PROFILEFREQ=16'
    XLSMPOPTS=' SCHEDULE=STATIC=0'
    XLSMPOPTS=' SEQTHRESHOLD=    inf'
    XLSMPOPTS=' SPINS=64'
    XLSMPOPTS=' STACK=4194304'
    XLSMPOPTS=' USRTHDS=0'
    XLSMPOPTS=' YIELDS=64'
OPENMP DISPLAY ENVIRONMENT END

Example 8 shows the output from the XL compiler runtime. The first section displays the internal settings in the OpenMP runtime. The second section contains the build specific information. If your system contains multiple versions of the OpenMP run time, this information helps you to identify which version of runtime the application is linked in. The third section is the ICV settings and the IBM extension settings.

Summary

The OpenMP API support in the XL compilers provides programmers a way to parallelize their sequential application by annotating pragmas or directives in their C, C++, or Fortran programs. Using the XL compilers to parallelize applications can use the multicore hardware, for example IBM POWER8™. The addition of selected OpenMP V4.0 features in the XL C/C++ V13.1.2 and XL Fortran V15.1.2 for Linux on Power little endian gives users more ways to express atomic operations and helps inform users about the runtime settings in debugging or porting code to different platforms.

Resources

  1. OpenMP Architecture Review Board: OpenMP Application Program Interface Version 3.1 (July 2011)
  2. OpenMP Architecture Review Board: OpenMP Application Program Interface Version 4.0 (July 2013)
  3. Compiler Reference – XL C/C++ for Linux 13.1.2, for little endian distributions
  4. Compiler Reference – XL Fortran for Linux 15.1.2, for little endian distributions

Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=1005549
ArticleTitle=OpenMP support in IBM XL compilers
publish-date=06292015