Importing a Spark, Scikit-learn, or XGBoost model into WML for z/OS
WML for z/OS provides a model utility library that you can download and install into your own Python environment. You can use the library to save a Spark, Scikit-learn, and XGBoost model locally on your distributed system and then import those models into WML for z/OS.
Before you begin
- Verify that the Python environment on your local system supports XGBoost 0.90 and Scikit-learn 0.22.x releases.
- Create the TENTDATA table and load the test data as described in Preparing data for a model in Db2 for z/OS.
- Locate the following information:
- User name and password for the WML for z/OS web user interface.
- JDBC connection information, authorization ID, and password for the Db2 subsystem where the TENTDATA sample table is created.
- IP address of the host system where your WML for z/OS metadata service runs.
Procedure
- Locate the
wmlz_model_utils-2.2.202006011818-py3-none-any.whl
package in the $IML_INSTALL_DIR/imlpython/iml-pkgs directory.As its name indicates, the
wmlz_model_utils-2.2.202006011818-py3-none-any.whl
package contains the WML for z/OS Python utility library. - Download the package file onto your local system where you run your own Python environment.
- Install the package into the Python environment.
- Create a new Spark, Scikit-learn, or XGBoost model in your Python environment.
- To create a Spark model:
- Import PySpark packages:
from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import HashingTF, Tokenizer, StringIndexer, VectorAssembler from pyspark.ml.evaluation import RegressionEvaluator from pyspark.sql import SparkSession
- Read data from the
MLZ.TENTDATA
table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:spark = SparkSession.builder .appName("Python Spark SQL basic example") .master(Spark Master) .config("spark.some.config.option", "<value>") .getOrCreate() df = spark.read.format("jdbc") .options(driver='com.ibm.db2.jcc.DB2Driver', url='jdbc:db2://<url>:<port>/<location>', user='<userid>', password='<password>', dbtable='MLZ.TENTDATA') .load() df.show(5)
Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.
- Transform data, split the data to training and testing subsets, construct the feature vector, and then use
DecisionTreeClassifier
to train the model:[trainingDF, validationDF] = df.randomSplit([0.7, 0.3]) genderIndexer = StringIndexer(inputCol="GENDER", outputCol="GENDER_Index") maritalStatusIndexer = StringIndexer(inputCol="MARITAL_STATUS", outputCol="MARITAL_STATUS_Index") professionIndexer = StringIndexer(inputCol="PROFESSION", outputCol="PROFESSION_Index") assembler = VectorAssembler(inputCols=["GENDER_Index", "MARITAL_STATUS_Index", "PROFESSION_Index", "AGE"], outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) pipeline = Pipeline(stages=[genderIndexer, maritalStatusIndexer, professionIndexer, assembler, lr]) tentModelPY = pipeline.fit(trainingDF)
- Evaluate the model:
predictions = tentModelPY.transform(validationDF) evaluator = RegressionEvaluator(metricName="rmse", labelCol="label",predictionCol="prediction") rmse = evaluator.evaluate(predictions) print("Root-mean-square error = " + str(rmse))
- Import PySpark packages:
- To create a Scikit-learn model:
- Import Scikit-learn and PySpark packages:
import pandas from pyspark import SparkContext from pyspark.sql import SQLContext from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder from sklearn.tree import DecisionTreeClassifier from sklearn import metrics
- Read data from the
MLZ.TENTDATA
table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:sc = SparkContext.getOrCreate() sqlContext = SQLContext(sc) df = sqlContext.read.format("jdbc").options( driver='com.ibm.db2.jcc.DB2Driver', url='jdbc:db2://<url>:<port>/<location>', user='<userid>', password='<password>', dbtable='MLZ.TENTDATA').load().toPandas() print(df.head(5))
Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.
- Transform data, split the data to subsets for training and testing, construct the feature vector, and then use
DecisionTreeClassifier
to train the model:df['GENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER']) df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE']) df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS']) df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION']) df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY']) X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", "COUNTRY_INDEX"]], df["TENT_LABEL"] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0])) pipeline = Pipeline([('clf',DecisionTreeClassifier())]) tentModelDT = pipeline.fit(X_train, y_train)
- Evaluate the model:
expected = y_test predicted = tentModelDT.predict(X_test) print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))
- Import Scikit-learn and PySpark packages:
- To create a XGBoost model:
- Import PySpark, Scikit-learn, and XGBoost packages:
import pandas from pyspark import SparkContext from pyspark.sql import SQLContext from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder import xgboost as xgb from sklearn.metrics import mean_squared_error
- Read data from the
MLZ.TENTDATA
table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:sc = SparkContext.getOrCreate() sqlContext = SQLContext(sc) df = sqlContext.read.format("jdbc").options( driver='com.ibm.db2.jcc.DB2Driver', url='jdbc:db2://<url>:<port>/<location>', user='<userid>', password='<password>', dbtable='MLZ.TENTDATA').load().toPandas() print(df.head(5))
Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.
- Transform data, split the data to training set and test subset, construct the
dmatrix
, and then use XGBoost to train the model:ENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER']) df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE']) df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS']) df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION']) df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY']) X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", "COUNTRY_INDEX"]], df["label"] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0])) param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'reg:linear'} plst = param.items() model = xgb.train(plst, dtrain)
- Evaluate the model:
expected = y_test predicted = model.predict(dtest) print('mse is: ' + str(mean_squared_error(expected, predicted)))
- Import PySpark, Scikit-learn, and XGBoost packages:
- To create a Spark model:
- Save the model to your local file system by using the WMLz Python model utility.
- To save the Spark model:
from wmlz.ml_model_util import MLModelUtil model_util = MLModelUtil() model_util.save(model, "./tentModelPY.tar.gz", training_data = trainingDF) print(“Model saved to local file system successfully”)
- To save the Scikit-learn model:
from wmlz.ml_model_util import MLModelUtil model_util = MLModelUtil() model_util.save(tentModelDT, "./tentModelDT.tar.gz", training_data=X_train, training_target=y_train) print(“Model saved to local file system successfully”)
- To save the XGBoost model and the native XGBoost model schema:
from wmlz.ml_model_util import MLModelUtil model_util = MLModelUtil() model_util.save(model, "./XGBModel.tar.gz", dmatrix=dtrain) print(“Model saved to local file system successfully”)
- To save the Spark model:
- Import the model from your local file system into WML for z/OS.
- Sign into the WML for z/OS web user interface with your user name and password.
- From the side bar, go to the Model Management page, select the Models tab, and click Import model.
- Specify a name for the new model and select
MLz Model
for model format. - Browse, select, and upload the Spark (tentModelPY.tar.gz), Scikit-learn (tentModelDT.tar.gz), or XGBoost (XGBModel.tar.gz) model file.
- Click Import model to import the Spark, Scikit-learn, or XGBoost model.
- Verify that the imported Spark, Scikit-learn, or XGBoost model shows up on the Models tab of the Model Management page.