For systems and workloads compared, Snap ML running on IBM® Power® System AC922 2 servers (that are based on the IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPUs (NVLink 2.0) delivers up to 80x speedup (for example, Lasso on a large model/data set) when training machine learning algorithms to accuracy compared to tested scikit-learn on x86 combination 1.
Snap ML works efficiently even when the model’s memory footprint or data set exceeds the GPU memory.
Scikit-learn must be used with x86 because stand-alone cuML on x86 fails if the data set or runtime artifacts exceed the GPU memory while testing.
Price Prediction data set (which is large, feature-rich, and sparse) from Kaggle is used. 3
System configuration
IBM Power System AC922
2x Intel Xeon Gold 6150
40 cores (two 20c chips), POWER9 with NVLink 2.0
36 cores (two 18c chips)
3.8 GHz, 1 TB memory
2.70 GHz, 512 GB memory
Single Tesla V100 GPU, 16 GB GPU
Single Tesla V100 GPU, 16 GB GPU
Red Hat Enterprise Linux 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243
Ubuntu 18.04.3 LTS (4.15.0-54-generic) with CUDA 10.1.243
nvidia-driver-418.67
nvidia-driver-418.67
Software: WML CE 1.6.2; pai4sk-1.5.0, NumPy 1.16.5
Software: scikit-learn 0.21.3, NumPy 1.17.3
Notes:
The results were obtained by running each experiment (each script) and by taking the training times of the best 5 among the 10 runs.
Refer to the benchmarking and preprocessing scripts at: https://github.com/IBM/powerai/tree/master/benchmarks/SnapML/linear_models
Price Prediction data set details (nrows=1185328,cols=17052) was passed in compressed sparse row (CSR) format for both Snap ML and scikit-learn. Source: https://www.kaggle.com/c/mercari-price-suggestion-challenge/data
The results are for runs on a single GPU.
Results can vary with different data sets with different levels of sparseness and model parameters.
Test date: 1st Dec 2019
For the systems and workload compared, Snap ML library combined with IBM® Power® System AC922 servers (that are based on IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPU (NVLink 2.0) provide speedup when training machine learning algorithms ( such as Ridge, Lasso, and Logistic regression) to accuracy compared to tested cuML on x86 systems. 1
Snap ML with Power AC922 outperforms tested cuML with x86 combinations on:
Feature-rich data sets such as Price Prediction and Epsilon (utilization of more features enable faster time to accuracy).
Sparse data sets such as Taxi and Price Prediction (can be handled natively by the Snap ML generalized linear models).
System configuration
IBM Power System AC922
2x Intel Xeon Gold 6150
40 cores (two 20c chips), POWER9 with NVLink 2.0
36 cores (two 18c chips)
3.8 GHz, 1 TB memory
2.70 Ghz, 512 GB memory
Single Tesla V100 GPU, 16 GB GPU
Single Tesla V100 GPU, 16 GB GPU
Red Hat Enterprise Linux 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243
Ubuntu 18.04.3 LTS (4.15.0-54-generic) with CUDA 10.1.243
nvidia-driver-418.67
nvidia-driver-418.67
Software: WML CE 1.6.2; pai4sk 1.5.0, NumPy 1.16.5
Software: cuML 0.10.0. NumPy 1.17.3
Notes:
The results were obtained by running each experiment (each script) and by taking the training times of the best 5 among the 10 runs.
Refer to the benchmarking and preprocessing scripts at: https://github.com/IBM/powerai/tree/master/benchmarks/SnapML/linear_models
The location of the data sets is specified as part of the above URL .The cropped price prediction was obtained by setting the max_features parameter to 500.
Epsilon data set: Snap ML/Power AC922 speedup over cuML/x86 is 0.92X with Ridge regression, 1.85X with Lasso regression and 3.15X with Logistic regression.
Price Prediction data set: Snap ML/Power AC922 speedup over cuML/x86 is 164X with Ridge regression , 278X with Lasso regression and 68X with Logistic regression.
Taxi data set: Snap ML/Power AC922 speedup over cuML/x86 is 1.7X with Ridge regression, 3.9X with Lasso regression, and 9.8X with Logistic regression.
Higgs data set: Snap ML/Power AC922 speedup over cuML/x86 is 0.13X with Ridge regression, 1.16X with Lasso regression, and 0.79X with Logistic regression.
The results are for runs on a single GPU.
The experiment used the ndarray format for cuML for all data sets and the ndarray format for dense data sets and compressed sparse row (CSR) format for sparse data sets for Snap ML.
Results can vary with different data sets with different levels of sparseness and model parameters.
Test date: 1st Dec 2019
For the systems and workload compared, IBM® Power® System AC922 servers (based on the IBM POWER9 processor technology) with NVIDIA Tesla V100 GPUs connected through NVLink 2.0 along with WML-CE TensorFlow Large Model Support (LMS) can provide:
3.13x increase in throughput compared to tested x86 systems with four GPUs
2.77x increase in throughput compared to tested x86 systems with eight GPUs
Training DeepLabv3+ based model with distributed deep learning (DDL) and TensorFlow LMS on Power AC922 using PASCAL Visual Object Classes (VOC) 2012 data set with an image resolution of 2100 ^ 2 and batch size 1
Critical machine learning (ML) capabilities: Regression, nearest neighbor, recommendation systems, clustering, and so on, and use system memory across the NVLink 2.0.
NVLink 2.0 enables enhanced host-to-GPU communication
IBM's LMS for deep learning enables seamless use of host and GPU memory for improved performance
System configuration
IBM Power System AC922
2x Intel Xeon E5-2698
40 cores (two 20c chips), POWER9 with NVLink 2.0
40 cores (two 20c chips)
3.8 GHz, 1 TB memory
2.40 GHz, 768 GB memory
Four Tesla V100 GPU, 16 GB-GPU
Eight Tesla V100 GPU, 16 GB-GPU
Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9) with CUDA 10.1.168/ CUDNN 7.5.1
Ubuntu 16.04.5 with CUDA .10.1.168 / CUDNN 7.5.1
nvidia-driver 418.67
nvidia-driver 418.39
Software: IBM TFLMS (POWER9), TFLMSv2- WML-CE 1.6.1 tensorflow-large-model-support 2.0.1
Software: TFLMSv2: WML-CE 1.6.1 tensorflow-large-model-support 2.0.1
Notes:
Results are based on IBM internal measurements running 1000 iterations of DeepLabv3+ model (batch size=1, resolution=2100^2) on PASCAL VOC 2012 data set (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/).
DeepLabv3: https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models
TFLMSv2 parameters (swapout_threshold=1,swapin_ahead=1,swapin_groupby=0,sync_mode=0).
Results can vary with different data sets and model parameters.
Test date: July 19 and July 20, 2019
For the systems and workload compared, IBM® Power® System AC922 servers (based on the IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPUs connected through NVLink 2.0 along with WML-CE PyTorch Large Model Support (LMS) can provide:
2.9x increase in throughput compared to tested x86 systems with four GPUs
2.4x increase in throughput compared to tested x86 systems with eight GPUs
Training DeepLabv3+ based model with distributed deep learning (DDL) and Pytorch LMS on Power AC922 using PASCAL Visual Object Classes (VOC) 2012 data set with an image resolution of 2200 ^ 2 and batch size 2
Critical machine learning (ML) capabilities: Regression, nearest neighbor, recommendation systems, clustering, and so on, and use system memory across NVLink 2.0
NVLink 2.0 enables enhanced host-to-GPU communication
IBM's LMS for deep learning enables seamless use of host and GPU memory for improved performance
System configuration
IBM Power System AC922
2x Intel Xeon E5-2698
40 cores (two 20c chips), POWER9 with NVLink 2.0
40 cores (two 20c chips)
3.8 GHz, 1 TB memory
2.40 GHz, 768 GB memory
Four Tesla V100 GPU, 16 GB-GPU
Eight Tesla V100 GPU, 16 GB-GPU
Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9) with CUDA 10.1.168/ CUDNN 7.5.1
Ubuntu 18.04.2 with CUDA .10.1.168 / CUDNN 7.5.1
nvidia-driver – 418.67
nvidia-driver – 418.67
Software: IBM PyTorch (POWER9), WML-CE 1.6.1 PyTorch 1.1.0
Software: WML-CE 1.6.1 PyTorch 1.1.0
Notes:
Results are based on IBM internal measurements running 1000 iterations of DeepLabv3+ model (batch size=2, resolution=2200^2) on PASCAL VOC 2012 data set (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/).
DeepLabv3+ (https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models)
PyTorch parameters (limit_lms=0 ,size_lms=1MB).
Results can vary with different data sets and model parameters.
Test date: July 27 and July 28, 2019
Accelerate Data Scientist productivity and drive faster insights with IBM DSX Local on IBM Power System AC922
For the systems and workload compared:
Power AC922 (based on the IBM POWER9™ processor technology with NVIDIA GPUs) completes running GPU-accelerated K-means clustering with 15 GB data in half the time of tested x86 systems (Skylake 6150 with NVIDIA GPUs).
Power AC922 delivers 2x faster insights for GPU-accelerated K-means clustering workload than Intel® Xeon® SP Gold 6150-based servers.
IBM Power Systems™ cluster with Power LC922 (CPU optimized) and Power AC922 (GPU accelerated) provides an optimized infrastructure for DSX Local.
System configuration
Power AC922
Two-socket Intel Xeon Gold 6150
IBM POWER9, 2x 20 cores/3.78 GHz, and 4x NVIDIA Tesla V100 GPUs with NVLink
Gold 6150, 2x 18 cores/2.7 GHz and 4x NVIDIA Tesla V100 GPUs
1 TB memory, each user assigned 180 GB in DSX Local
768 GB memory, each user assigned 180 GB in DSXL
2x 960 GB SSD
2x 960 SSD
10 GbE two-port
10 GbE two-port
RHEL 7.5 for POWER9
RHEL 7.5
Data Science Experience Local 1.2 fp3
Data Science Experience Local 1.2 fp3
Notes:
The results are based on IBM internal testing of the core computational step to form five clusters using a 5270410 x 301 float64 data set (15 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 6/13/2018 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.
Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Download the workload and try it yourself: https://github.com/theresax/DSX_perf_eval/tree/master/clustering. Note: You will need to use your own data with similar dimensions as described in the README.md file.
Accelerate Data Scientist productivity and drive faster insights with DSX Local on IBM Power System LC922
For the systems and workload compared:
Power LC922 running K-means clustering with 1 GB data scales to 2X more users than tested x86 systems
Power LC922 supports 2x more users at a faster response time than Intel® Xeon® SP Gold 6140-based servers.
Power LC922 delivers over 41% faster insights for the same (four to eight) number of users.
System configuration
Power LC922
Two-socket Intel Xeon SP Gold 6140
IBM POWER9™, 2x 20 cores/2.6 GHz/512 GB memory
Gold 6140, 2x 18 cores/2.4 GHz/512 GB memory
10x 4 TB HDD
10x 4 TB HDD
10 GbE two-port
10 GbE two-port
RHEL 7.5 for POWER9
RHEL 7.5
Data Science Experience Local 1.1.2
Data Science Experience Local 1.1.2
Notes:
The test results are based on IBM internal testing of the core computational step to form five clusters using a 350694 x 301 float64 data set (1 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 4/21/18 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.
Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.
For the systems and workload compared:
3.7X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train medical/satellite images
Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory
NVLink 2.0 enables enhanced Host to GPU communication
LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance
System configuration
IBM Power System AC922
2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0
Xeon E5-2640 v4
40 cores (2 x 20c chips)
20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory
2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs
(4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9)
Ubuntu 16.04
CUDA 9.1/CUDNN 7
CUDA 9.0/CUDNN 7
Notes:
Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet data set (2560x2560)
Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and https://github.com/chainer/chainer
Test date: November 26, 2017
Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.
For the systems and workload compared:
3.8X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train on 2240 x 2240 images
Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory
NVLink 2.0 enables enhanced Host to GPU communication
LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance
System configuration
IBM Power System AC922
2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0
Intel Xeon E5-2640 v4
40 cores (2 x 20c chips)
20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory
2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs
(4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9)
Ubuntu 16.04
CUDA 9.1/CUDNN 7
CUDA 9.0/CUDNN 7
Notes:
Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet data set (2240x2240)
Software: IBM Caffe with LMS Source code https://github.com/ibmsoe/caffe/tree/master-lms
Date of testing: November 26, 2017
Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0
For the systems and workload compared:
35% more images processed per second vs tested x86 systems
ResNet50 testing on ILSVRC 2012 data set (aka Imagenet 2012)
Training on 1.2M images
Validation on 50K images
System configuration
IBM Power System AC922
2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0
Intel Xeon E5-2640 v4
40 cores (2 x 20c chips)
20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory
2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs
(4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9)
Ubuntu 16.04
Tensorflow 1.4.0 framework and HPM Resnet50
Tensorflow 1.4.0 framework and HPM Resnet50
Notes:
Results are based IBM Internal Measurements running 1000 iterations of HPM Resnet50 on 1.2M images and validation on 50K images with data set from ILSVRC 2012 also known as Imagenet 2012.
Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet; local-parameter-device: gpu; variable-update: replicated
Date of testing: November 26, 2017
Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on a cluster of IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0
For the systems and workload compared:
2.3X more images processed per second vs tested x86 systems
PowerAI Distributed Deep Learning (DDL) library provides innovative distribution methods enabling AI frameworks to scale to multiple servers leveraging all attached GPUs
ResNet50 testing on ILSVRC 2012 data set (also known as Imagenet 2012)
Training on 1.2M images
Validation on 50K images
System configuration
4-nodes IBM Power System AC922
4-nodes of 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0
Intel Xeon E5-2640 v4
40 cores (2 x 20c chips)
20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory
2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs
(4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9)
Ubuntu 16.04
Tensorflow 1.4.0 framework and HPM Resnet50
Tensorflow 1.4.0 framework and HPM Resnet50
Notes:
Results are based IBM Internal Measurements running 5000 iterations of HPM+DDL ResNet50 on Power and 500 iterations of HPM Resnet50 on x86 on 1.2M images and validation on 50K images with data set from ILSVRC 2012 also known as Imagenet 2012.
Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Data: Imagenet; variable-update: distributed_replicated
Date of testing: December 2, 2017