Getting started with Snap Machine Learning (SnapML) on Apache Spark

Find information about getting started with SnapML Spark APIs

This release of WML CE includes Snap Machine Learning (Snap ML) APIs that work with Apache Spark. Snap ML is a library for training generalized linear models. It is being developed at IBM® with the vision to remove training time as a bottleneck for machine learning applications. Snap ML supports many classical machine learning models and scales gracefully to data sets with billions of examples or features. It also offers distributed training, GPU acceleration, and supports sparse data structures.

SnapML Spark APIs are shipped under the conda package snapml-spark. This package offers distributed training of models across a cluster of machines. The library is available through a SparkML-like interface and can seamlessly be integrated into existing pySpark application.

The following APIs are supported:

To load the data, the following API is provided:

DatasetReader

SnapML uses a proprietary data format named snap for efficient data loading for both single and multiple node training. The following list is a set of APIs provided to load and store the datasets in snap format:

Example programs for both of these APIs are provided as part of the Conda package. To find out how to run the sample programs, refer to the READMEs placed under $CONDA_PREFIX/snap-ml-spark/examples/.

This package can be used only with Apache Spark or Hortonworks Data Platform installation and requires SPARK_HOME environment variable to be set. In order to put the required libraries to the corresponding environment variables, run the following command before using this library:

source $CONDA_PREFIX/bin/snap-ml-spark-activate

Table 1. Supported versions
Application	Version
Spark	2.3
Python	2.7 or 3.6
OpenJDK/ Adoptopenjdk	1.8.0_181/1.8.0_202
HDP	2.6.5.0-292

Note: This package will be in technology preview for this release.