Scripting with Python for Spark
IBM® SPSS® Modeler can run Python scripts using the Apache Spark framework to process data. This documentation provides the Python API description for the interfaces provided.
The IBM SPSS Modeler installation includes a Spark distribution (For example, IBM SPSS Modeler 18.5 includes Spark 3.4.0).
Prerequisites
- If you plan to run Python/Spark scripts against IBM SPSS Analytic Server, you must have a connection to Analytic Server, and Analytic Server must have access to a compatible installation of Apache Spark. Refer to your IBM SPSS Analytic Server documentation for details about using Apache Spark as the execution engine.
- If you plan to run Python/Spark scripts against IBM SPSS Modeler Server (or the local server that is
included with IBM SPSS Modeler Client, which
requires Windows 64 or Mac64), you no longer need to install Python and edit
options.cfg to use your Python installation. Starting with version 18.1, IBM SPSS Modeler now includes a Python distribution.
However, if you require a certain module that is not included with the default IBM SPSS Modeler Python distribution, go to
<Modeler_installation_directory>/python and install extra packages.Even though a Python distribution is now included with IBM SPSS Modeler, you can still point to your own Python installation as in previous releases if desired by adding the following option to options.cfg:
Windows example:# Set to the full path to the python executable (including the executable name) to enable use of PySpark. eas_pyspark_python_path, ""
Linux example:eas_pyspark_python_path, "C:\\Your_Python_Install\\python.exe"
eas_pyspark_python_path, "/Your_Python_Install/bin/python"
Note: If you point to your own Python installation, it must be version 3.8.x. IBM SPSS Modeler was tested with Anaconda for Python 3.8 and Python 3.8.6. - If you plan to determine the PySpark version that is used by the IBM SPSS Modeler, you can run the following
script.
import pkg_resources print("pandas version") print(pkg_resources.get_distribution("pandas").version) print("pyspark version") print(pkg_resources.get_distribution("pyspark").version)
Based on your OS, run the following commands in the corresponding paths before you run the script to determine the PySpark version.
- For Windows
- Path:
<Modeler-InstallationDirectory>\18.5\spark\pythonCommand:
"<Modeler-InstallationDirectory>\18.5\python\python.exe" setup.py sdist
- For Mac
- Path: <Modeler-InstallationDirectory>/18.5/IBM SPSS
Modeler.app/Contents/spark/pythonCommand:
"<Modeler-InstallationDirectory>/18.5/IBM SPSS Modeler.app/Contents/python/bin/python3" setup.py sdist
- For Linux
- Path:
<Modeler-InstallationDirectory>/18.5/spark/pythonCommand:
<Modeler-InstallationDirectory>/18.5/python/bin/python3 setup.py sdist
Note: <Modeler-InstallationDirectory> corresponds to the IBM SPSS Modeler Thick Client installation for local server, whereas it corresponds to the IBM SPSS Modeler Server installation for remote server.
The IBM SPSS Analytic Server context object
import spss.pyspark.runtime
asContext = spss.pyspark.runtime.getContext()
sparkContext = asc.getSparkContext()
sqlContext = asc.getSparkSQLContext()
Refer to your Apache Spark documentation for information about the Spark context and the SQL context.
Accessing data
inputData = asContext.getSparkInputData()
asContext.setSparkOutputData(outputData)
outputData = sqlContext.createDataFrame(rdd)
Defining the data model
A node that produces data must also define a data model that describes the fields visible downstream of the node. In Spark SQL terminology, the data model is the schema.
A Python/Spark script defines its output data model in the form of a
pyspsark.sql.types.StructType
object. A StructType
describes a row
in the output data frame and is constructed from a list of StructField
objects.
Each StructField
describes a single field in the output data model.
:schema
attribute of
the input data frame:inputSchema = inputData.schema
StructField
constructor:field = StructField(name, dataType, nullable=True, metadata=None)
Refer to your Spark documentation for information about the constructor.
Provide at least the field name and its data type. Optionally, you can specify metadata to provide a measure, role, and description for the field (see Data metadata).
DataModelOnly mode
IBM SPSS Modeler needs to know the output data
model for a node, before the node is executed, in order to enable downstream editing. To obtain the
output data model for a Python/Spark node, IBM SPSS Modeler runs the script in a special "data
model only" mode where no data is available. The script can identify this mode using the
isComputeDataModelOnly
method on the Analytic Server context object.
if asContext.isComputeDataModelOnly():
inputSchema = asContext.getSparkInputSchema()
outputSchema = ... # construct the output data model
asContext.setSparkOutputSchema(outputSchema)
else:
inputData = asContext.getSparkInputData()
outputData = ... # construct the output data frame
asContext.setSparkOutputData(outputData)
Building a model
A node that builds a model must return to the execution context some content that describes the model sufficiently that the node which applies the model can recreate it exactly at a later time.
Model content is defined in terms of key/value pairs where the meaning of the keys and the values is known only to the build and score nodes and is not interpreted by Modeler in any way. Optionally, the node may assign a MIME type to a value with the intent that Modeler might display those values which have known types to the user in the model nugget.
asContext.setModelContentFromString(key, value, mimeType=None)
value = asContext.getModelContentToString(key)
asContext.setModelContentFromPath(key, path)
Note that in this case there is no option to specify a MIME type because the bundle may contain various content types.
path = asContext.createTemporaryFolder()
path = asContext.getModelContentToPath(key)
Error handling
spss.pyspark.exceptions
. For
example:from spss.pyspark.exceptions import ASContextException
if ... some error condition ...:
raise ASContextException("message to display to user")