IBM Support

perfcol-mldl

IBM Power Systems

Machine Learning / Deep Learning performance proof-points

Training performance for deep learning networks using common frameworks.

IBM Power System AC922 using Snap ML algorithm versus tested scikit-learn on x86 systems

For systems and workloads compared, Snap ML running on IBM® Power® System AC9222 servers (that are based on the IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPUs (NVLink 2.0) delivers up to 80x speedup (for example, Lasso on a large model/data set) when training machine learning algorithms to accuracy compared to tested scikit-learn on x86 combination1.

  • Snap ML works efficiently even when the model’s memory footprint or data set exceeds the GPU memory.
  • Scikit-learn must be used with x86 because stand-alone cuML on x86 fails if the data set or run-time artifacts exceed the GPU memory while testing.
  • Price Prediction data set (which is large, feature-rich, and sparse) from Kaggle is used.3
1

System configuration

IBM Power System AC922 2x Intel Xeon Gold 6150
40 cores (two 20c chips), POWER9 with NVLink 2.0 36 cores (two 18c chips)
3.8 GHz, 1 TB memory 2.70 GHz, 512 GB memory
Single Tesla V100 GPU, 16 GB GPU Single Tesla V100 GPU, 16 GB GPU
Red Hat Enterprise Linux 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243 Ubuntu 18.04.3 LTS (4.15.0-54-generic) with CUDA 10.1.243
nvidia-driver-418.67 nvidia-driver-418.67
Software: WML CE 1.6.2; pai4sk-1.5.0, NumPy 1.16.5 Software: scikit-learn 0.21.3, NumPy 1.17.3

Notes:

  1. The results were obtained by running each experiment (each script) and by taking the training times of the best 5 among the 10 runs.
  2. Refer to the benchmarking and preprocessing scripts at: https://github.com/IBM/powerai/tree/master/benchmarks/SnapML/linear_models
  3. Price Prediction data set details (nrows=1185328,cols=17052) was passed in compressed sparse row (CSR) format for both Snap ML and scikit-learn. Source: https://www.kaggle.com/c/mercari-price-suggestion-challenge/data
  4. The results are for runs on a single GPU.
  5. Results can vary with different data sets with different levels of sparseness and model parameters.
  6. Test date: 1st Dec 2019

IBM Power System AC922 using Snap ML algorithm versus tested cuML on x86 systems

For the systems and workload compared, Snap ML library combined with IBM® Power® System AC922 servers (that are based on IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPU (NVLink 2.0) provide speedup when training machine learning algorithms ( such as Ridge, Lasso, and Logistic regression) to accuracy compared to tested cuML on x86 systems.1

Snap ML with Power AC922 outperforms tested cuML with x86 combinations on:

  • Feature-rich data sets such as Price Prediction and Epsilon (utilization of more features enable faster time to accuracy).
  • Sparse data sets such as Taxi and Price Prediction (can be handled natively by the Snap ML generalized linear models).
2

System configuration

IBM Power System AC922 2x Intel Xeon Gold 6150
40 cores (two 20c chips), POWER9 with NVLink 2.0 36 cores (two 18c chips)
3.8 GHz, 1 TB memory 2.70 Ghz, 512 GB memory
Single Tesla V100 GPU, 16 GB GPU Single Tesla V100 GPU, 16 GB GPU
Red Hat Enterprise Linux 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243 Ubuntu 18.04.3 LTS (4.15.0-54-generic) with CUDA 10.1.243
nvidia-driver-418.67 nvidia-driver-418.67
Software: WML CE 1.6.2; pai4sk 1.5.0, NumPy 1.16.5 Software: cuML 0.10.0. NumPy 1.17.3

Notes:

  1. The results were obtained by running each experiment (each script) and by taking the training times of the best 5 among the 10 runs.
  2. Refer to the benchmarking and preprocessing scripts at: https://github.com/IBM/powerai/tree/master/benchmarks/SnapML/linear_models
  3. The location of the data sets is specified as part of the above URL .The cropped price prediction was obtained by setting the max_features parameter to 500.
  4. Epsilon data set: Snap ML/Power AC922 speedup over cuML/x86 is 0.92X with Ridge regression, 1.85X with Lasso regression and 3.15X with Logistic regression.
  5. Price Prediction data set: Snap ML/Power AC922 speedup over cuML/x86 is 164X with Ridge regression , 278X with Lasso regression and 68X with Logistic regression.
  6. Taxi data set: Snap ML/Power AC922 speedup over cuML/x86 is 1.7X with Ridge regression, 3.9X with Lasso regression, and 9.8X with Logistic regression.
  7. Higgs data set: Snap ML/Power AC922 speedup over cuML/x86 is 0.13X with Ridge regression, 1.16X with Lasso regression, and 0.79X with Logistic regression.
  8. The results are for runs on a single GPU.
  9. The experiment used the ndarray format for cuML for all data sets and the ndarray format for dense data sets and compressed sparse row (CSR) format for sparse data sets for Snap ML.
  10. Results can vary with different data sets with different levels of sparseness and model parameters.
  11. Test date: 1st Dec 2019

DeepLabv3+ image segmentation model with TFLMSv2

For the systems and workload compared, IBM® Power® System AC922 servers (based on the IBM POWER9 processor technology) with NVIDIA Tesla V100 GPUs connected through NVLink 2.0 along with WML-CE TensorFlow Large Model Support (LMS) can provide:

  • 3.13x increase in throughput compared to tested x86 systems with four GPUs
  • 2.77x increase in throughput compared to tested x86 systems with eight GPUs
    Training DeepLabv3+ based model with distributed deep learning (DDL) and TensorFlow LMS on Power AC922 using PASCAL Visual Object Classes (VOC) 2012 data set with an image resolution of 2100 ^ 2 and batch size 1
  • Critical machine learning (ML) capabilities: Regression, nearest neighbor, recommendation systems, clustering, and so on, and utilize system memory across the NVLink 2.0.
    • NVLink 2.0 enables enhanced host-to-GPU communication
    • IBM's LMS for deep learning enables seamless use of host and GPU memory for improved performance
 

3_1

3_2

System configuration

IBM Power System AC922 2x Intel Xeon E5-2698
40 cores (two 20c chips), POWER9 with NVLink 2.0 40 cores (two 20c chips)
3.8 GHz, 1 TB memory 2.40 GHz, 768 GB memory
Four Tesla V100 GPU, 16 GB-GPU Eight Tesla V100 GPU, 16 GB-GPU
Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9) with CUDA 10.1.168/ CUDNN 7.5.1 Ubuntu 16.04.5 with CUDA .10.1.168 / CUDNN 7.5.1
nvidia-driver 418.67 nvidia-driver 418.39
Software: IBM TFLMS (POWER9), TFLMSv2- WML-CE 1.6.1 tensorflow-large-model-support 2.0.1 Software: TFLMSv2: WML-CE 1.6.1 tensorflow-large-model-support 2.0.1
 

Notes:

  1. Results are based on IBM internal measurements running 1000 iterations of DeepLabv3+ model (batch size=1, resolution=2100^2) on PASCAL VOC 2012 data set (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/).
  2. DeepLabv3: https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models
  3. TFLMSv2 parameters (swapout_threshold=1,swapin_ahead=1,swapin_groupby=0,sync_mode=0).
  4. Results can vary with different data sets and model parameters.
  5. Test date: July 19 and July 20, 2019

DeepLabv3+ image segmentation model with PyTorch LMS

For the systems and workload compared, IBM® Power® System AC922 servers (based on the IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPUs connected through NVLink 2.0 along with WML-CE PyTorch Large Model Support (LMS) can provide:

  • 2.9x increase in throughput compared to tested x86 systems with four GPUs
  • 2.4x increase in throughput compared to tested x86 systems with eight GPUs
    Training DeepLabv3+ based model with distributed deep learning (DDL) and Pytorch LMS on Power AC922 using PASCAL Visual Object Classes (VOC) 2012 data set with an image resolution of 2200 ^ 2 and batch size 2
  • Critical machine learning (ML) capabilities: Regression, nearest neighbor, recommendation systems, clustering, and so on, and utilize system memory across NVLink 2.0
    • NVLink 2.0 enables enhanced host-to-GPU communication
    • IBM's LMS for deep learning enables seamless use of host and GPU memory for improved performance
4_1

4_2

System configuration

IBM Power System AC922 2x Intel Xeon E5-2698
40 cores (two 20c chips), POWER9 with NVLink 2.0 40 cores (two 20c chips)
3.8 GHz, 1 TB memory 2.40 GHz, 768 GB memory
Four Tesla V100 GPU, 16 GB-GPU Eight Tesla V100 GPU, 16 GB-GPU
Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9) with CUDA 10.1.168/ CUDNN 7.5.1 Ubuntu 18.04.2 with CUDA .10.1.168 / CUDNN 7.5.1
nvidia-driver – 418.67 nvidia-driver – 418.67
Software: IBM PyTorch (POWER9), WML-CE 1.6.1 PyTorch 1.1.0 Software: WML-CE 1.6.1 PyTorch 1.1.0
 

Notes:

  1. Results are based on IBM internal measurements running 1000 iterations of DeepLabv3+ model (batch size=2, resolution=2200^2) on PASCAL VOC 2012 data set (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/).
  2. DeepLabv3+ (https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models)
  3. PyTorch parameters (limit_lms=0 ,size_lms=1MB).
  4. Results can vary with different data sets and model parameters.
  5. Test date: July 27 and July 28, 2019

IBM Data Science Experience (DSX) Local on IBM® Power® System AC922

Accelerate Data Scientist productivity and drive faster insights with IBM DSX Local on IBM Power System AC922

For the systems and workload compared:

  • Power AC922 (based on the IBM POWER9™ processor technology with NVIDIA GPUs) completes running GPU-accelerated K-means clustering with 15 GB data in half the time of tested x86 systems (Skylake 6150 with NVIDIA GPUs).
  • Power AC922 delivers 2x faster insights for GPU-accelerated K-means clustering workload than Intel® Xeon® SP Gold 6150-based servers.
  • IBM Power Systems™ cluster with Power LC922 (CPU optimized) and Power AC922 (GPU accelerated) provides an optimized infrastructure for DSX Local.
5

System configuration

Power AC922 Two-socket Intel Xeon Gold 6150
IBM POWER9, 2x 20 cores/3.78 GHz, and 4x NVIDIA Tesla V100 GPUs with NVLink Gold 6150, 2x 18 cores/2.7 GHz and 4x NVIDIA Tesla V100 GPUs
1 TB memory, each user assigned 180 GB in DSX Local 768 GB memory, each user assigned 180 GB in DSXL
2x 960 GB SSD 2x 960 SSD
10 GbE two-port 10 GbE two-port
RHEL 7.5 for POWER9 RHEL 7.5
Data Science Experience Local 1.2 fp3 Data Science Experience Local 1.2 fp3

Notes:

  • The results are based on IBM internal testing of the core computational step to form five clusters using a 5270410 x 301 float64 data set (15 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 6/13/2018 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.
  • Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  • Download the workload and try it yourself: https://github.com/theresax/DSX_perf_eval/tree/master/clustering. Note: You will need to use your own data with similar dimensions as described in the README.md file.

IBM Data Science Experience (DSX) Local on IBM® Power® System LC922

Accelerate Data Scientist productivity and drive faster insights with DSX Local on IBM Power System LC922

For the systems and workload compared:

  • Power LC922 running K-means clustering with 1 GB data scales to 2X more users than tested x86 systems
  • Power LC922 supports 2x more users at a faster response time than Intel® Xeon® SP Gold 6140-based servers.
  • Power LC922 delivers over 41% faster insights for the same (four to eight) number of users.
6

System configuration

Power LC922 Two-socket Intel Xeon SP Gold 6140
IBM POWER9™, 2x 20 cores/2.6 GHz/512 GB memory Gold 6140, 2x 18 cores/2.4 GHz/512 GB memory
10x 4 TB HDD 10x 4 TB HDD
10 GbE two-port 10 GbE two-port
RHEL 7.5 for POWER9 RHEL 7.5
Data Science Experience Local 1.1.2 Data Science Experience Local 1.1.2
 

Notes:

  • The test results are based on IBM internal testing of the core computational step to form five clusters using a 350694 x 301 float64 data set (1 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 4/21/18 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.
  • Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Chainer on IBM POWER9™ with Nvidia Tesla V100 delivers 3.7X reduction in AI model training versus tested x86 systems

Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.

For the systems and workload compared:

  • 3.7X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train medical/satellite images
  • Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory
    • NVLink 2.0 enables enhanced Host to GPU communication
    • LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance
7

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
CUDA 9.1/CUDNN 7 CUDA 9.0/CUDNN 7

Notes:

Caffe on IBM POWER9™ with Nvidia Tesla V100 delivers 3.8X reduction in AI model training versus tested x86 systems

Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.

For the systems and workload compared:

  • 3.8X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train on 2240 x 2240 images
  • Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory
    • NVLink 2.0 enables enhanced Host to GPU communication
    • LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance
8

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
CUDA 9.1/CUDNN 7 CUDA 9.0/CUDNN 7

Notes:

  • Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2240x2240)
  • Software: IBM Caffe with LMS Source code https://github.com/ibmsoe/caffe/tree/master-lms
  • Date of testing: November 26, 2017

IBM POWER9™ with Nvidia Tesla V100 delivers35% more images/second on TensorFlow versus tested x86 systems

Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0

For the systems and workload compared:

  • 35% more images processed per second vs tested x86 systems
  • ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012)
    • Training on 1.2M images
    • Validation on 50K images
9

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
Tensorflow 1.4.0 framework and HPM Resnet50 Tensorflow 1.4.0 framework and HPM Resnet50

Notes:

  • Results are based IBM Internal Measurements running 1000 iterations of HPM Resnet50 on 1.2M images and validation on 50K images with Dataset from ILSVRC 2012 also known as Imagenet 2012.
  • Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet; local-parameter-device: gpu; variable-update: replicated
  • Date of testing: November 26, 2017

Distributed Deep Learning: IBM POWER9™ with Nvidia Tesla V100 results in 2.3X more data processed on TensorFlow versus tested x86 systems

Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on a cluster of IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0

For the systems and workload compared:

  • 2.3X more images processed per second vs tested x86 systems
  • PowerAI Distributed Deep Learning (DDL) library provides innovative distribution methods enabling AI frameworks to scale to multiple servers leveraging all attached GPUs
  • ResNet50 testing on ILSVRC 2012 dataset (also known as Imagenet 2012)
    • Training on 1.2M images
    • Validation on 50K images
10

System configuration

4-nodes IBM Power System AC922 4-nodes of 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
Tensorflow 1.4.0 framework and HPM Resnet50 Tensorflow 1.4.0 framework and HPM Resnet50

Notes:

  • Results are based IBM Internal Measurements running 5000 iterations of HPM+DDL ResNet50 on Power and 500 iterations of HPM Resnet50 on x86 on 1.2M images and validation on 50K images with Dataset from ILSVRC 2012 also known as Imagenet 2012.
  • Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Data: Imagenet; variable-update: distributed_replicated
  • Date of testing: December 2, 2017

IBM Corporation 2020®

IBM, the IBM logo, ibm.com, POWER and POWER8 are trademarks of the International Business Machines Corp., registered in many jurisdictions worldwide. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Other product and service names may be the trademarks of IBM or other companies.

The content in this document (including any pricing references) is current as of July 22, 2015 and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates

THE INFORMATION CONTAINED ON THIS WEBSITE IS PROVIDED ON AN "AS IS" BASIS WITHOUT ANY WARRANTY EXPRESS OR IMPLIED INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDY ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.

In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

All information contained on this website is subject to change without notice. The information contained in this website does not affect or change IBM product specifications or warranties. IBM’s products are warranted according to the terms and conditions of the agreements under which they are provided. Nothing in this website shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties.