Scripting with Python for Spark

IBM® SPSS® Modeler can execute Python scripts using the Apache Spark framework to process data. This documentation provides the Python API description for the interfaces provided.

The IBM SPSS Modeler installation includes a Spark distribution (for example, IBM SPSS Modeler 18.1.1 includes Spark 2.1.0). If you prefer using a different version of Spark, you can configure it by adding the eas_spark_path and eas_spark_version parameters to options.cfg. Possible values for the eas_spark_version parameter are 1.x or 2.x. For example:

eas_spark_path, "C:/spark_1.5"
eas_spark_version, "1.x"

Prerequisites

If you plan to execute Python/Spark scripts against IBM SPSS Analytic Server, you must have a connection to Analytic Server, and Analytic Server must have access to a compatible installation of Apache Spark. Refer to your IBM SPSS Analytic Server documentation for details about using Apache Spark as the execution engine.
If you plan to execute Python/Spark scripts against IBM SPSS Modeler Server (or the local server included with IBM SPSS Modeler Client, which requires Windows 64 or Mac64), you no longer need to install Python and edit options.cfg to use your Python installation. Starting with version 18.1, IBM SPSS Modeler now includes a Python distribution. However, if you require a certain module that is not included with the default IBM SPSS Modeler Python distribution, you can go to <Modeler_installation_directory>/python and install additional packages.
Even though a Python distribution is now included with IBM SPSS Modeler, you can still point to your own Python installation as in previous releases if desired by adding the following option to options.cfg:
```
# Set to the full path to the python executable 
(including the executable name) to enable use of PySpark.
eas_pyspark_python_path, ""
```
For example:
```
eas_pyspark_python_path, "C:/Your_Python_Install/python.exe"
```

The IBM SPSS Analytic Server context object

The execution context for a Python/Spark script is defined by an Analytic Server context object. When running against IBM SPSS Modeler Server, the context object is for the embedded version of Analytic Server that is included with the IBM SPSS Modeler Server installation. To obtain the context object, the script must include the following:

import spss.pyspark.runtime
asContext = spss.pyspark.runtime.getContext()

From the Analytic Server context, you can obtain the Spark context and the SQL context:

sparkContext = asc.getSparkContext()
sqlContext = asc.getSparkSQLContext()

Refer to your Apache Spark documentation for information about the Spark context and the SQL context.

Accessing data

Data is transferred between a Python/Spark script and the execution context in the form of a Spark SQL DataFrame. A script that consumes data (that is, any node except a source node) must retrieve the data frame from the context:

inputData = asContext.getSparkInputData()

A script that produces data (that is, any node except a terminal node) must return a data frame to the context:

asContext.setSparkOutputData(outputData)

You can use the SQL context to create an output data frame from an RDD where required:

outputData = sqlContext.createDataFrame(rdd)

Defining the data model

A node that produces data must also define a data model that describes the fields visible downstream of the node. In Spark SQL terminology, the data model is the schema.

A Python/Spark script defines its output data model in the form of a pyspsark.sql.types.StructType object. A StructType describes a row in the output data frame and is constructed from a list of StructField objects. Each StructField describes a single field in the output data model.

You can obtain the data model for the input data using the :schema attribute of the input data frame:

inputSchema = inputData.schema

Fields that are passed through unchanged can be copied from the input data model to the output data model. Fields that are new or modified in the output data model can be created using the StructField constructor:

field = StructField(name, dataType, nullable=True, metadata=None)

Refer to your Spark documentation for information about the constructor.

You must provide at least the field name and its data type. Optionally, you can specify metadata to provide a measure, role, and description for the field (see Data metadata).

DataModelOnly mode

IBM SPSS Modeler needs to know the output data model for a node, before the node is executed, in order to enable downstream editing. To obtain the output data model for a Python/Spark node, IBM SPSS Modeler executes the script in a special "data model only" mode where there is no data available. The script can identify this mode using the isComputeDataModelOnly method on the Analytic Server context object.

The script for a transformation node can follow this general pattern:

if asContext.isComputeDataModelOnly():   
        inputSchema = asContext.getSparkInputSchema()   
        outputSchema = ... # construct the output data model   
        asContext.setSparkOutputSchema(outputSchema)
else:   
        inputData = asContext.getSparkInputData()   
        outputData = ... # construct the output data frame    
        asContext.setSparkOutputData(outputData)

Building a model

A node that builds a model must return to the execution context some content that describes the model sufficiently that the node which applies the model can recreate it exactly at a later time.

Model content is defined in terms of key/value pairs where the meaning of the keys and the values is known only to the build and score nodes and is not interpreted by Modeler in any way. Optionally the node may assign a MIME type to a value with the intent that Modeler might display those values which have known types to the user in the model nugget.

A value in this context may be PMML, HTML, an image, etc. To add a value to the model content (in the build script):

asContext.setModelContentFromString(key, value, mimeType=None)

To retrieve a value from the model content (in the score script):

value = asContext.getModelContentToString(key)

As a shortcut, where a model or part of a model is stored to a file or folder in the file system you can bundle all the content stored to that location in one call (in the build script):

asContext.setModelContentFromPath(key, path)

Note that in this case there is no option to specify a MIME type because the bundle may contain various content types.

If you need a temporary location to store the content while building the model you can obtain an appropriate location from the context:

path = asContext.createTemporaryFolder()

To retrieve existing content to a temporary location in the file system (in the score script):

path = asContext.getModelContentToPath(key)

Error handling

To raise errors, throw an exception from the script and display it to the IBM SPSS Modeler user. Some exceptions are predefined in the module spss.pyspark.exceptions. For example:

from spss.pyspark.exceptions import ASContextException
if ... some error condition ...:
     raise ASContextException("message to display to user")