What you missed from IBM XL Compilers by not attending Supercomputing
Nicole Trudeau 270006D1QA Visits (5904)
IBM XL Compilers attended the SuperComputing 2017 conference in November to promote the latest OpenMP compiler support available in the latest versions, released Dec 15. The IBM XL compilers have been key in accelerating several CORAL benchmarks written using the OpenMP 4.5 programming model. These benchmarks were accelerated on the IBM POWER systems with NVIDIA GPUs, offloading key computation to GPUs easily and with excellent performance. If you haven't heard of CORAL, it is an activity among three of the US Department of Energy's National Laboratories to build Summit and Sierra, high-performance computing technologies. See more details here.
OFFLOAD TO GPU FOR A PERFORMANCE BOOST
We showcased the latest GPU accelerated performance results of LLNL CORAL benchmark LULESH by using OpenMP 4.5 for offloading. Using the IBM XL Compilers, we demonstrated a ~10x performance improvement by offloading computation (synchronously) to the GPUs compared to CPU-only. Using the latest compiler features, we were able to offload the execution asynchronously, achieving an additional 24% performance gain, resulting in ~12x overall speedup compared to CPU-only (see Figure 2).
Figure 2: Performance of LLNL CORAL benchmark LULESH when compiled with the IBM XL C/C++ compiler V13.1.6 compiler on a single node for CPU-only (2 POWER8s), GPU (synchronous offload to 4 Pascal P100 GPUs), and GPU (asynchronous offload to 4 Pascal P100 GPUs). Larger is better for this Zones/Second Figure-of-Merit metric.
WRITE IN OPENMP 4.5 FOR EASY GPU OFFLOADING
We met with many potential customers in client briefings and at our booth, who had a strong interest in our OpenMP 4.5 compiler support and wanted to experiment using their applications by offloading computation to the IBM POWER9+NVIDIA Volta systems. They wondered what it takes to program CPU/GPU interaction - is it complicated? What performance can be expected? Most knew of the low-level programming language CUDA and were worried about the complexity of programming in a language like that.
Like CUDA, OpenMP 4.5 is a language that can be incorporated as pragmas and directives into your existing C, C++ and Fortran code to parallelize your sequential application and offload compute-intensive parts to the GPU. In contrast to CUDA, OpenMP 4.5 is a higher level language that can be more easily incorporated into your existing programs.
We are happy to demonstrate that you can use OpenMP 4.5 to offload to GPU with minimal overhead in comparison to CUDA. Using the Stream benchmarks, we demonstrate that OpenMP 4.5 and CUDA performance are within 1% of each other (see Figure 3).
Figure 3: OpenMP 4.5 and CUDA performance are within 1% of each other. The BabelStream benchmark exists in two forms - one written with OpenMP C/C++ and one written with CUDA C/C++. Performance is measured as throughput in MB/sec. Experiments were done on an IBM POWER8+NVIDIA Pascal system using the IBM XL C/C++ for Linux V13.1.6 compiler and the NVIDIA nvcc V9.1.76 compiler.
GPU PERFORMANCE IS SCALABLE: