Getting started with Snap Machine Learning (SnapML) on Apache Spark

Snap ML is a library for training generalized linear models. It is being developed at IBM® with the vision to remove training time as a bottleneck for machine learning applications. Snap ML supports many classical machine learning models and scales gracefully to data sets with billions of examples or features. It also offers distributed training, GPU acceleration, and supports sparse data structures.

SnapML Spark APIs are shipped under the conda package snapml-spark. This package offers distributed training of models across a cluster of machines. SnapML APIs can be used in a SparkML machine learning pipelines. These APIs can accept Spark Dataframes as input.

The following set of Spark Estimators are introduced:

The following set of Spark Transformers are introduced:

SnapML also supports a proprietary data format named snap for efficient data loading for both single and multiple node training. The following APIs are provided to load and store the data sets in snap format:

The following set of APIs are available to perform machine learning on snap formatted data in a pyspark application:

Example programs for all these APIs are provided as part of this conda package and can be found under $CONDA_PREFIX/snap-ml-spark/examples/. To submit these sample programs to a Spark cluster, make sure you include the following configuration into $SPARK_HOME/conf/spark-defaults.conf (after resolving the keywords):

spark.jars                         <CONDA_PREFIX>/snap-ml-spark/lib/snap-ml-spark-v1.4.0-ppc64le.jar
spark.executor.extraLibraryPath    <CONDA_PREFIX>/lib

Additionally, provide --driver-library-path <CONDA_PREFIX>/lib option with the spark-submit command. To learn how to run the sample programs, refer to the READMEs placed under $CONDA_PREFIX/snap-ml-spark/examples/.

This package can be used only with Apache Spark or Hortonworks Data Platform installation and requires the SPARK_HOME environment variable to be set. In order to put the required libraries to the corresponding environment variables, run the following command before using this library:

source $CONDA_PREFIX/bin/snap-ml-spark-activate

Table 1. Supported versions
Application	Version
Spark	2.3
Python	3.6 or 3.7
HDP	2.6.5.0-292
OpenJDK	1.8.0_181