Getting started with Snap Machine Learning (SnapML) on Apache Spark
Snap ML is a library for training generalized linear models. It is being developed at IBM® with the vision to remove training time as a bottleneck for machine learning applications. Snap ML supports many classical machine learning models and scales gracefully to data sets with billions of examples or features. It also offers distributed training, GPU acceleration, and supports sparse data structures.
SnapML Spark APIs are shipped under the conda package snapml-spark
.
This package offers distributed training of models across a cluster of machines. SnapML APIs can be
used in a SparkML machine learning pipelines. These APIs can accept Spark Dataframes as input.
snap
for
efficient data loading for both single and multiple node training. The following APIs are provided
to load and store the data sets in snap format:
snap
formatted data in a pyspark application:
$CONDA_PREFIX/snap-ml-spark/examples/
. To submit these sample
programs to a Spark cluster, make sure you include the following configuration into
$SPARK_HOME/conf/spark-defaults.conf
(after resolving the
keywords):spark.jars <CONDA_PREFIX>/snap-ml-spark/lib/snap-ml-spark-v1.4.0-ppc64le.jar
spark.executor.extraLibraryPath <CONDA_PREFIX>/lib
Additionally, provide --driver-library-path <CONDA_PREFIX>/lib
option with the spark-submit
command. To learn how to run the sample programs,
refer to the READMEs placed under $CONDA_PREFIX/snap-ml-spark/examples/
.
source $CONDA_PREFIX/bin/snap-ml-spark-activate
Application | Version |
---|---|
Spark | 2.3 |
Python | 3.6 or 3.7 |
HDP | 2.6.5.0-292 |
OpenJDK | 1.8.0_181 |