Getting started with Snap Machine Learning (SnapML) on Apache Spark
Find information about getting started with SnapML Spark APIs
This release of WML CE includes Snap Machine Learning (Snap ML) APIs that work with Apache Spark. Snap ML is a library for training generalized linear models. It is being developed at IBM® with the vision to remove training time as a bottleneck for machine learning applications. Snap ML supports many classical machine learning models and scales gracefully to data sets with billions of examples or features. It also offers distributed training, GPU acceleration, and supports sparse data structures.
SnapML Spark APIs are shipped under the conda package snapml-spark
. This package
offers distributed training of models across a cluster of machines. The library is available through
a SparkML-like interface and can seamlessly be integrated into existing pySpark application.
The following APIs are supported:
To load the data, the following API is provided:
SnapML uses a proprietary data format named snap
for efficient data loading for
both single and multiple node training. The following list is a set of APIs provided to load and
store the datasets in snap
format:
Example programs for both of these APIs are provided as part of the Conda package. To find out
how to run the sample programs, refer to the READMEs placed under
$CONDA_PREFIX/snap-ml-spark/examples/
.
This package can be used only with Apache Spark or Hortonworks Data Platform installation and
requires SPARK_HOME
environment variable to be set. In order to put the required
libraries to the corresponding environment variables, run the following command before using this
library:
source $CONDA_PREFIX/bin/snap-ml-spark-activate
Application | Version |
---|---|
Spark | 2.3 |
Python | 2.7 or 3.6 |
OpenJDK/ Adoptopenjdk | 1.8.0_181/1.8.0_202 |
HDP | 2.6.5.0-292 |