Getting started with RAPIDS
WML CE supports RAPIDS cuDF, a dataframe manipulation library similar to pandas, and cuML, a collection of machine learning libraries that provide GPU versions of scikit-learn algorithms. RAPIDS packages are available only on Power® architecture in WML CE.
Overview
This release of WML CE has
cudf
0.11.0 and cuml
0.11.0. The cudf
and
cuml
conda packages are supported on Python 3.6 or 3.7. More information on RAPIDS can be found at
https://rapids.ai/index.html.
powerai-rapids
by using conda install powerai-rapids
to get the following RAPIDS packages: - cuDF: https://github.com/rapidsai/cudf
cuDF is a GPU DataFrame library that provides a pandas-like API for loading, joining, aggregating, filtering, and manipulating data.
- cuML: https://github.com/rapidsai/cuml
cuML provides scikit-learn-like APIs that run traditional tabular machine learning tasks on GPUs without going into the details of CUDA programming. It also features multi-GPU and multi-node-multi-GPU operation (using Dask) for a growing list of algorithms.
- cuPy: https://cupy.chainer.org/
CuPy is an open-source NumPy-compatible matrix library accelerated by CUDA.
- dask-cuda: https://github.com/rapidsai/dask-cuda
This library provides utilities to improve deployment and management of Dask workers on CUDA-enabled systems.
- dask-cudf: https://github.com/rapidsai/dask-cudf
This package brings Dask support for distributed GPU DataFrames (cuDF).
cuDF
You can pass a cuDF dataframe into pai4sk APIs. This will result in both data preparation and training of the model done solely on GPU.
To make use of cuDF
with pai4sk
APIs, follow the steps,
import cudf
from cudf import DataFrame
df_trainX = DataFrame.from_pandas(pdf_trainX)
df_trainY = DataFrame.from_pandas(pdf_trainY)
# data used for training
# Create a C-contiguous DeviceNDArray from cuDF
from pai4sk.sml_io import copy_as_gpu_cmatrix
X_train = copy_as_gpu_cmatrix(df_trainX)
y_train = copy_as_gpu_cmatrix(df_trainY)
from pai4sk import LogisticRegression
lr = LogisticRegression(use_gpu=True)
lr.fit(X_train, y_train)
Currently, DeviceNDArray as input is supported in pai4sk for the following APIs:
Example programs for each of these APIs are provided as part of the conda package. To find out
how to run the sample programs, refer to the README placed under
$CONDA_PREFIX/pai4sk/local-examples/cudf-examples
.
cuML
The cuml
APIs can be directly used in a python program.
Some of the APIs of pai4sk are
modified to use cuml
APIs if cuML
conda package is installed. This
module will automatically fall back to original scikit-learn behavior when cuML
does not provide the necessary support. The following links are the list of such APIs:
- Clustering KMeans
- Clustering DBSCAN
- Decomposition PCA
- Decomposition TruncatedSVD
- K-Nearest Neighbors
Example programs for each of these APIs are provided as part of the conda package. To find out
how to run the sample programs, refer to the README placed under the subdirectories of
$CONDA_PREFIX/pai4sk/local-examples/cuml-examples
.
- Dask support for GPU-backed dataframe (dask-cuda & dask-cudf) and multi-GPU machine learning algorithms is in technology preview.
- In WML CE, RAPIDS packages are available only on IBM Power architecture.
- For some of the packages like
cuml
, you need to usepowerai-release=1.7.0
explicitly in the conda install command in order to pick up the latest versions of packages. For example:conda install cuml powerai-release=1.7.0
.