Getting started with RAPIDS

WML CE supports RAPIDS cuDF, a dataframe manipulation library similar to pandas, and cuML, a collection of machine learning libraries that provide GPU versions of scikit-learn algorithms. RAPIDS packages are available only on Power® architecture in WML CE.

Overview

This release of WML CE has cudf 0.11.0 and cuml 0.11.0. The cudf and cuml conda packages are supported on Python 3.6 or 3.7. More information on RAPIDS can be found at https://rapids.ai/index.html.

Install the meta package powerai-rapids by using conda install powerai-rapids to get the following RAPIDS packages:
  • cuDF: https://github.com/rapidsai/cudf

    cuDF is a GPU DataFrame library that provides a pandas-like API for loading, joining, aggregating, filtering, and manipulating data.

  • cuML: https://github.com/rapidsai/cuml

    cuML provides scikit-learn-like APIs that run traditional tabular machine learning tasks on GPUs without going into the details of CUDA programming. It also features multi-GPU and multi-node-multi-GPU operation (using Dask) for a growing list of algorithms.

  • cuPy: https://cupy.chainer.org/

    CuPy is an open-source NumPy-compatible matrix library accelerated by CUDA.

  • dask-cuda: https://github.com/rapidsai/dask-cuda

    This library provides utilities to improve deployment and management of Dask workers on CUDA-enabled systems.

  • dask-cudf: https://github.com/rapidsai/dask-cudf

    This package brings Dask support for distributed GPU DataFrames (cuDF).

cuDF

You can pass a cuDF dataframe into pai4sk APIs. This will result in both data preparation and training of the model done solely on GPU.

To make use of cuDF with pai4sk APIs, follow the steps,

import cudf
from cudf import DataFrame

df_trainX = DataFrame.from_pandas(pdf_trainX)
df_trainY = DataFrame.from_pandas(pdf_trainY)

# data used for training
# Create a C-contiguous DeviceNDArray from cuDF
from pai4sk.sml_io import copy_as_gpu_cmatrix
X_train = copy_as_gpu_cmatrix(df_trainX)
y_train = copy_as_gpu_cmatrix(df_trainY)

from pai4sk import LogisticRegression
lr = LogisticRegression(use_gpu=True)

lr.fit(X_train, y_train)

Currently, DeviceNDArray as input is supported in pai4sk for the following APIs:

Example programs for each of these APIs are provided as part of the conda package. To find out how to run the sample programs, refer to the README placed under $CONDA_PREFIX/pai4sk/local-examples/cudf-examples.

cuML

The cuml APIs can be directly used in a python program.

Some of the APIs of pai4sk are modified to use cuml APIs if cuML conda package is installed. This module will automatically fall back to original scikit-learn behavior when cuML does not provide the necessary support. The following links are the list of such APIs:

Example programs for each of these APIs are provided as part of the conda package. To find out how to run the sample programs, refer to the README placed under the subdirectories of $CONDA_PREFIX/pai4sk/local-examples/cuml-examples.

Notes:
  • Dask support for GPU-backed dataframe (dask-cuda & dask-cudf) and multi-GPU machine learning algorithms is in technology preview.
  • In WML CE, RAPIDS packages are available only on IBM Power architecture.
  • For some of the packages like cuml, you need to use powerai-release=1.7.0 explicitly in the conda install command in order to pick up the latest versions of packages. For example: conda install cuml powerai-release=1.7.0.