Technical Blog Post
Abstract
Deep Learning on OpenPOWER: Building Optimized Libraries for Deep Learning on OpenPOWER Linux Systems
Body
The Machine Learning and Deep Learning project in IBM Systems is a broad effort to build a co-optimized stack of hardware and software to make IBM Power Systems the best platform to develop and deploy cognitive applications. As part of this project, IBM has developed new processors, systems, and a co-optimized software stack uniquely optimized for AI applications.
The first offerings for this new era of cognitive computing are our first server designed from ground up for cognitive computing with the S822LC, and the PowerAI distribution of AI tools and libraries for the Ubuntu and RedHat Linux operating systems. Most data scientists and AI practitioners building cognitive solutions ptrefer to use the pre-built, pre-optimized deep learning frameworks of the PowerAI distribution.
In addition to creating the binary distribution of DL frameworks, we have also been working with the Open Source community to enable the open source frameworks and libraries to be built directly from the repositories to enable Deep Learning users to harness the power of the OpenPOWER ecosystem. With the introduction of little-endian OpenPOWER Linux, installation of open source applications on Power has never been easier.
If you need to build optimized libraries from source, this blog provides instructions on building and installing Optimized Libraries for Deep Learning on (little-endian) OpenPOWER Linux, such as Red Hat Enterprise Linux 7.1, SUSE Linux Enterprise Server 12, Ubuntu 14.04, and subsequent releases. These instructions are primarily focused on providing improves numeric libraries of importance for Deep Learning frameworks, such as libraries implementing the BLAS basic linear algebra library interfaces, and in particular ATLAS and OpenBLAS, and an accelerated Power math library providing optimized scalar and vectorized implementations of common mathematics functions.
While mathematics libraries (such as the system library libm) and BLS libraries, e.g., based on ATLAS or OpenBLAS, are available with many Linux operating system distributions, these distributions often lack many of the newest and best code improvements which are particularly important for high-performance computing applications such as Deep Learning. Thus, you can significantly improve Deep Learning performance by installing the advanced and highly tuned libraries as described here.
Installing the Mathematical Acceleration Subsystem (MASS) for Linux
To accelerate base mathematics functions by exploiting the advanced capabilities for the Power vector-scalar instruction set, IBM has made the MASS vector library freely available. MASS implements the common libmvec interfaces used by the GNU Compiler Collection and can be accessed from GCC compilers using the -mveclibabi=mass. You can find out more about MASS at the MASS for Linux Home Page.
For example, on Ubuntu 16.04 (also known under the distribution name “xenial”) use the following commands to configure the MASS repository and install MASS 8.1.4:
$ sudo aptitude install software-properties-common
$ sudo apt-add-repository "deb http://public.dhe.ibm.com/software/server/POWER/Linux/xl-compiler/eval/ppc64le/ubuntu/ xenial main"
$ sudo aptitude update
$ sudo aptitude install libxlmass-devel.8.1.4
For example, on Red Hat or CentOS use the following commands to configure the MASS repository and install MASS 8.1.3:
$ sudo rpm --import repomd.xml.key
$ sudo cp ibm-xl-compiler-eval.repo /etc/yum.repos.d/
$ sudo yum install libxlmass-devel.8.1.3
To compile applications to use the MASS libraries, invoke the GNU C Compilers with -mveclibabi=mass option. To link, specify the link time options -L/opt/ibm/xlmass/8.1.4/lib -lmass -lmassvp8 -lmass_simdp8 Thus a program may be compiled and linked to use MASS as follows:
$ gcc -O3 -mveclibabi=mass -c example.c
$ gcc -o example example.o -L/opt/ibm/xlmass/8.1.4/lib -lmass -lmassvp8 -lmass_simdp8
Many of the Deep Learning packages already include code to use MASS libraries to improve performance on Power. For example, when building Caffe, after installing MASS on your system, you can enable MASS with the by setting the USE_MASS flag in the build configuration file Makefile.config around line 12:
# MASS switch (uncomment to build with IBM Mathematical Acceleration Subsystem)
USE_MASS := 1
Also, as an example, when building Caffe, use the Makefile.config configuration file to specify the location of the MASS libraries on your system around line 72:
# MASS lib directories
# MASS_LIB := /opt/ibm/xlmass/8.1.4/lib
Installing OpenBLAS
OpenBLAS is a high-performance open source implementation of the BLAS basic linear algebra library interfaces. You can find more information about OpenBLAS at the OpenBLAS Project Homepage. Recent versions of OpenBLAS contain significant enhancements including support for the POWER vector-scalar instruction set which was designed to accelerate numerically intensive algorithms such as libraries implementing the BLAS basic linear algebra library interfaces.. To get the best performance for these libraries, download and build the latest release of OpenBLAS. Starting with release 0.2.19, the OpenBLAS master repository includes these enhancements.
To install the latest version of OpenBLAS, download the OpenBLAS source code as follows:
$ git clone git://github.com/xianyi/OpenBLAS.git
$ cd OpenBLAS
You can then build OpenBLAS for POWER8 with the command:
$ make TARGET=POWER8
Building OpenBLAS with MASS Support
The IBM MASS library consists of a set of mathematical functions for C, C++, and Fortran-language applications that are tuned for optimum performance on POWER architectures. Start by installing MASS on your system as described in the section on installing and using MASS.
Depending on the version of MASS installed on your system, you may have to update the MASSPATH variable used by the build process around line 46 of Makefile.power:
MASSPATH = /opt/ibm/xlmass/8.1.3/lib
Then, permanently enable MASS by setting the variable USE_MASS to 1, either by editing Makefile.power, or by specifying its value on the command line:
$ make USE_MASS=1 TARGET=POWER8
Building ATLAS on OpenPOWER
The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance. To achieve this, ATLAS includes a self-tuning framework to optimize its high-performance open source implementation of the BLAS basic linear algebra library interfaces for the system it is being installed. You can find more information about OpenBLAS at the ATLAS Project Homepage.
Recent versions of ATLAS contain significant enhancements including support for the POWER vector-scalar instruction set which was designed to accelerate numerically intensive algorithms such as libraries implementing the BLAS basic linear algebra library interfaces.. The recently releasedATLAS 3.10.3 in the ATLAS source repository is the most recent distribution version, and adds many improvements for little-ended OpenPOWER Linux systems. The ATLAS Developer branch includes the most recent enhancements for ATLAS, and includes support for the POWER vector-scalar instruction set which was designed to accelerate numerically intensive algorithms such as libraries implementing the BLAS basic linear algebra library interfaces.. To get the best performance for these libraries, download and build the latest release of OpenBLAS. Starting with release 3.11.16, or later, to include optimized support for the POWER vector-scalar instruction set.
To get the best performance for ATLAS libraries, download and build the latest release of ATLAS on the OpenPOWER Linux system you will be using ATLAS, and the installation framework will optimize the ATLAS library for this particular configuration, including CPU generation, and cache and memory sizes and latencies. In particular, the ATLAS project lead reports significant speedup starting with ATLAS 3.10.16, e.g., for general matrix-matrix multiply which are ciritically important for Deep Learning performance:
I have just released 3.11.36. The only performance improvement over
3.11.35 is for power, where my single precision performance went from
around 66% to 86%. More specifically. 3.11.36 for serial gemm of
N=6000, my power8 gets (% of peak):
dgemm : 87%
zgemm : 88%
sgemm : 86%
cgemm : 88%
Start by fownloading the compressed tar archive, e.g.,
To access the ATLAS source repository with your browser,
The software link off of this page allows for downloading the tarfile. The explicit download link is https://sourceforge.net/project/showfiles.php?group_id=23725. Once you have obtained the tarfile, you untar it in the directory where you want to keep the ATLAS source directory. The tarfile will create a subdirectory called ATLAS, which you may want to rename to make less generic. For instance, assuming I have saved the tarfile to ~/dload, and want to put the source in !/numerics, you could create ATLAS's source directory (SRCdir) with the following commands:
$ cd ~/numerics
$ bunzip2 -c ~/dload/atlas3.10.3.tar.bz2 | tar xfm -
$ mv ATLAS ATLAS3.11.39
To build and install ATLAS, turn off CPU throttling when installing ATLAS and follow these basic steps of an ATLAS install. However, please no
bunzip2 -c ~/atlas3.10.3.tar.bz2 | tar xfm - # create SRCdir
mv ATLAS ATLAS3.10.3 # get unique dir name
cd ATLAS3.10.3 # enter SRCdir
mkdir Linux_C2D64SSE3 # create BLDdir
cd Linux_C2D64SSE3 # enter BLDdir
../configure -b 64 --force-tids="4 0 8 16 24" \
-D c -DWALL \ # configure command
--prefix=/home/whaley/lib/atlas \ # install dir
--with-netlib-lapack-tarfile=/home/whaley/dload/lapack-3.4.1.tgz
make build # tune & build lib
make check # sanity check correct answer
make ptcheck # sanity check parallel
make time # check if lib is fast
make install # copy libs to install dir
Similarly, an IBM Power7 I have access to has 8 physical cores, but offers 64 SMT units. If you install with the default flags, your parallel speedup for moderate sized DGEMMs is around 4.75. On the other hand, if you add:
--force-tids="8 0 8 16 24 32 40 48 56"
Then the parallel DGEMM speedup for moderate sized problems is more like 6.5.
If you build POWER8 machine, with four physical cores, that are again shared 8-way, leading to the need to add to configure:
--force-tids="4 0 8 16 24"
When using the force-rids option, the first number specifies the number of physical cores (in the examples there are 8 physical POWER7 cores and 4 physical POWER8 cores), and the four following numbers are thread ids to use.
Building netlib-java for Spark on OpenPOWER
In addition to the native dynamic libraries described above, we have also ported netlib-java for Spark to OpenPOWER Linux systems. We have submitted those modifcations to the maintainer of fommil, the project of which netlib-java is a part. In the meantime, we have also created a PowerPC-enabled fork of netlib-java at https://github.com/ibmsoe/netlib-java.
Build instructions for the unmodified netlib-java project may be found at the following blog =>
See what you can do with Deep Learning on OpenPOWER
I am inviting you to explore Deep Learning on OpenPOWER systems, and exploit the full potential of an open ecosystem built on collaborative innovation, as we continue to co-optimize and expand the hardware and software stack for Deep Learning on Power.
I look forward to hearing about the performance you get from Deep Learning on OpenPOWER. Share how you want to use Deep Learning on OpenPOWER and how Deep Learning on OpenPOWER will enable you to build the next generation of cognitive applications by posting in the comments section below.
Dr. Michael Gschwind is Chief Engineer for Machine Learning and Deep Learning for IBM Systems where he leads the development of hardware/software integrated products for cognitive computing. During his career, Dr. Gschwind has been a technical leader for IBM’s key transformational initiatives, leading the development of the OpenPOWER Hardware Architecture as well as the software interfaces of the OpenPOWER Software Ecosystem. In previous assignments, he was a chief architect for Blue Gene, POWER8, POWER7, and Cell BE. Dr. Gschwind is a Fellow of the IEEE, an IBM Master Inventor and a Member of the IBM Academy of Technology.
UID
ibm16169845
