Importing a Spark, Scikit-learn, or XGBoost model into WML for z/OS

WML for z/OS provides a model utility library that you can download and install into your own Python environment. You can use the library to save a Spark, Scikit-learn, and XGBoost model locally on your distributed system and then import those models into WML for z/OS.

Before you begin

Verify that the Python environment on your local system supports XGBoost 0.90 and Scikit-learn 0.22.x releases.
Create the TENTDATA table and load the test data as described in Preparing data for a model in Db2 for z/OS.
Locate the following information:
- User name and password for the WML for z/OS web user interface.
- JDBC connection information, authorization ID, and password for the Db2 subsystem where the TENTDATA sample table is created.
- IP address of the host system where your WML for z/OS metadata service runs.

Procedure

Locate the wmlz_model_utils-2.2.202006011818-py3-none-any.whl package in the $IML_INSTALL_DIR/imlpython/iml-pkgs directory.

As its name indicates, the wmlz_model_utils-2.2.202006011818-py3-none-any.whl package contains the WML for z/OS Python utility library.
Download the package file onto your local system where you run your own Python environment.
Install the package into the Python environment.

Create a new Spark, Scikit-learn, or XGBoost model in your Python environment.

To create a Spark model:

Import PySpark packages:


from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, StringIndexer, VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:


spark = SparkSession.builder
    .appName("Python Spark SQL basic example")
    .master(Spark Master)
    .config("spark.some.config.option", "<value>")
    .getOrCreate()

df = spark.read.format("jdbc")
        .options(driver='com.ibm.db2.jcc.DB2Driver',
            url='jdbc:db2://<url>:<port>/<location>',
            user='<userid>', password='<password>',
            dbtable='MLZ.TENTDATA')
        .load()

df.show(5)

Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

Transform data, split the data to training and testing subsets, construct the feature vector, and then use DecisionTreeClassifier to train the model:


[trainingDF, validationDF] = df.randomSplit([0.7, 0.3])

genderIndexer = StringIndexer(inputCol="GENDER", outputCol="GENDER_Index")
maritalStatusIndexer = StringIndexer(inputCol="MARITAL_STATUS", 
          outputCol="MARITAL_STATUS_Index")
professionIndexer = StringIndexer(inputCol="PROFESSION", outputCol="PROFESSION_Index")
assembler = VectorAssembler(inputCols=["GENDER_Index", "MARITAL_STATUS_Index", 
          "PROFESSION_Index", "AGE"], outputCol="features")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[genderIndexer, maritalStatusIndexer, professionIndexer, 
          assembler, lr])
tentModelPY = pipeline.fit(trainingDF)

Evaluate the model:


predictions = tentModelPY.transform(validationDF)
evaluator = RegressionEvaluator(metricName="rmse", 
          labelCol="label",predictionCol="prediction")

rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

To create a Scikit-learn model:

Import Scikit-learn and PySpark packages:


import pandas
from pyspark import SparkContext
from pyspark.sql import SQLContext

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:


sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

df = sqlContext.read.format("jdbc").options(
    driver='com.ibm.db2.jcc.DB2Driver',
    url='jdbc:db2://<url>:<port>/<location>',
    user='<userid>', password='<password>',
    dbtable='MLZ.TENTDATA').load().toPandas()

print(df.head(5))

Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

Transform data, split the data to subsets for training and testing, construct the feature vector, and then use DecisionTreeClassifier to train the model:


df['GENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER'])
df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE'])
df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS'])
df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION'])
df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY'])

X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", 
        "COUNTRY_INDEX"]], df["TENT_LABEL"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0]))

pipeline = Pipeline([('clf',DecisionTreeClassifier())])
tentModelDT = pipeline.fit(X_train, y_train)

Evaluate the model:


expected = y_test
predicted = tentModelDT.predict(X_test)

print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

To create a XGBoost model:

Import PySpark, Scikit-learn, and XGBoost packages:


import pandas
from pyspark import SparkContext
from pyspark.sql import SQLContext

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb
from sklearn.metrics import mean_squared_error

Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:


sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

df = sqlContext.read.format("jdbc").options(
    driver='com.ibm.db2.jcc.DB2Driver',
    url='jdbc:db2://<url>:<port>/<location>',
    user='<userid>', password='<password>',
    dbtable='MLZ.TENTDATA').load().toPandas()

print(df.head(5))

Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

Transform data, split the data to training set and test subset, construct the dmatrix, and then use XGBoost to train the model:


ENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER'])
df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE'])
df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS'])
df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION'])
df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY'])

X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", 
       "COUNTRY_INDEX"]], df["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0]))

param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'reg:linear'}

plst = param.items()
model = xgb.train(plst, dtrain)

Evaluate the model:


expected = y_test
predicted = model.predict(dtest)

print('mse is: ' + str(mean_squared_error(expected, predicted)))

Save the model to your local file system by using the WMLz Python model utility.

To save the Spark model:


from wmlz.ml_model_util import MLModelUtil

model_util = MLModelUtil()
model_util.save(model,
    "./tentModelPY.tar.gz",
    training_data = trainingDF)
print(“Model saved to local file system successfully”)

To save the Scikit-learn model:


from wmlz.ml_model_util import MLModelUtil

model_util = MLModelUtil()
model_util.save(tentModelDT,
    "./tentModelDT.tar.gz",
    training_data=X_train,
    training_target=y_train)
print(“Model saved to local file system successfully”)

To save the XGBoost model and the native XGBoost model schema:


from wmlz.ml_model_util import MLModelUtil

model_util = MLModelUtil()
model_util.save(model,
    "./XGBModel.tar.gz",
    dmatrix=dtrain)
print(“Model saved to local file system successfully”)

Import the model from your local file system into WML for z/OS.
1. Sign into the WML for z/OS web user interface with your user name and password.
2. From the side bar, go to the Model Management page, select the Models tab, and click Import model.
3. Specify a name for the new model and select MLz Model for model format.
4. Browse, select, and upload the Spark (tentModelPY.tar.gz), Scikit-learn (tentModelDT.tar.gz), or XGBoost (XGBModel.tar.gz) model file.
5. Click Import model to import the Spark, Scikit-learn, or XGBoost model.
Verify that the imported Spark, Scikit-learn, or XGBoost model shows up on the Models tab of the Model Management page.