Limitations

Note these limitations when you offload computations to the NVIDIA GPUs with IBM® XL C/C++ for Linux, V16.1.

Workaround to get debugging support on GPUs

To debug an OpenMP application, you need to compile it with -g -qsmp=noopt.

The reason is that due to a limitation in NVVM, IBM XL C/C++ for Linux, V16.1 cannot generate debug information when GPU runtime inlining is enabled. To cope with this case, you can turn off GPU runtime inlining by specifying -qsmp=noopt. GPU runtime inlining is enabled at all other optimization levels.

Limitation related to compute mode

If the compute mode of a target device is not set to default, a CUDA context created by your code might cause the OpenMP runtime to fail. The reason is that the target device cannot create a second context, which might be required depending on which OpenMP thread tries to initialize the device. You can check the compute mode to verify whether such case happens when you suspect a segment fault or deadlock.

Performance of nested parallelism on Volta

On Volta, nested level-1 parallel region and nested SIMD region run sequentially when only a subset of threads in a warp are active at the point of reaching the nested region. This is to cope with the independent thread scheduling of Volta to ensure forward progress in the presence of thread divergence.

Limitation caused by warp divergence and thread dependency

When the control path among two or more CUDA threads in the same warp diverges, these CUDA threads are serialized. IBM XL C/C++ for Linux generates one CUDA thread for each OpenMP thread. As a result, suppose thread A and B are two OpenMP threads, the program might hang if the progress of thread A depends on thread B performing some action, and if thread A executes first while thread B stalls because of warp divergence.

Warp divergence occurs when two threads of the same warp diverge in their execution due to a branch instruction, where one thread branches and the other does not. This leads to serialization of the two threads by the CUDA hardware until their execution path converges again.

The following two examples demonstrate the problem:

Example 1

int omp_get_thread_num();

int main() {
  #pragma omp target 
  {
    int i = 0;
    #pragma omp parallel
    {
      if (omp_get_thread_num() == 0) {   // threadIdx.x 0 takes this path.
        #pragma omp atomic
        i++;
      } 
      else {                       // The other threads take this path.
        int local_i;
        do {
          #pragma omp atomic read
          local_i = i;
        } while (local_i == 0);
      }
    }
  }
}

Example 2

int main() {
  #pragma omp target 
  {
    int i = 0;
    #pragma omp parallel
    #pragma omp sections
    {
      #pragma omp section   // threadIdx.x 0 executes this section.
        #pragma omp atomic
        i++;
  
      #pragma omp section  // threadIdx.x 1 executes this section.
      {
        int local_i;
        do {
          #pragma omp atomic read
          local_i = i;
        } while (local_i == 0);
      }
    }
  }
}

Voice your opinion on getting help information

Ask IBM compiler experts a technical question in the IBM XL compilers forum