Importing a Spark, Scikit-learn, or XGBoost model into WML for z/OS

WML for z/OS provides a model utility library that you can download and install into your own Python environment. You can use the library to save a Spark, Scikit-learn, and XGBoost model locally on your distributed system and then import those models into WML for z/OS.

Before you begin

  • Verify that the Python environment on your local system supports XGBoost 0.90 and Scikit-learn 0.22.x releases.
  • Create the TENTDATA table and load the test data as described in Preparing data for a model in Db2 for z/OS.
  • Locate the following information:
    • User name and password for the WML for z/OS web user interface.
    • JDBC connection information, authorization ID, and password for the Db2 subsystem where the TENTDATA sample table is created.
    • IP address of the host system where your WML for z/OS metadata service runs.

Procedure

  1. Locate the wmlz_model_utils-2.2.202006011818-py3-none-any.whl package in the $IML_INSTALL_DIR/imlpython/iml-pkgs directory.

    As its name indicates, the wmlz_model_utils-2.2.202006011818-py3-none-any.whl package contains the WML for z/OS Python utility library.

  2. Download the package file onto your local system where you run your own Python environment.
  3. Install the package into the Python environment.
  4. Create a new Spark, Scikit-learn, or XGBoost model in your Python environment.
    • To create a Spark model:
      1. Import PySpark packages:
        
        from pyspark.ml import Pipeline
        from pyspark.ml.classification import LogisticRegression
        from pyspark.ml.feature import HashingTF, Tokenizer, StringIndexer, VectorAssembler
        from pyspark.ml.evaluation import RegressionEvaluator
        from pyspark.sql import SparkSession
      2. Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:
        
        spark = SparkSession.builder
            .appName("Python Spark SQL basic example")
            .master(Spark Master)
            .config("spark.some.config.option", "<value>")
            .getOrCreate()
        
        df = spark.read.format("jdbc")
                .options(driver='com.ibm.db2.jcc.DB2Driver',
                    url='jdbc:db2://<url>:<port>/<location>',
                    user='<userid>', password='<password>',
                    dbtable='MLZ.TENTDATA')
                .load()
        
        df.show(5)

        Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

      3. Transform data, split the data to training and testing subsets, construct the feature vector, and then use DecisionTreeClassifier to train the model:
        
        [trainingDF, validationDF] = df.randomSplit([0.7, 0.3])
        
        genderIndexer = StringIndexer(inputCol="GENDER", outputCol="GENDER_Index")
        maritalStatusIndexer = StringIndexer(inputCol="MARITAL_STATUS", 
                  outputCol="MARITAL_STATUS_Index")
        professionIndexer = StringIndexer(inputCol="PROFESSION", outputCol="PROFESSION_Index")
        assembler = VectorAssembler(inputCols=["GENDER_Index", "MARITAL_STATUS_Index", 
                  "PROFESSION_Index", "AGE"], outputCol="features")
        
        lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
        pipeline = Pipeline(stages=[genderIndexer, maritalStatusIndexer, professionIndexer, 
                  assembler, lr])
        tentModelPY = pipeline.fit(trainingDF)
        
      4. Evaluate the model:
        
        predictions = tentModelPY.transform(validationDF)
        evaluator = RegressionEvaluator(metricName="rmse", 
                  labelCol="label",predictionCol="prediction")
        
        rmse = evaluator.evaluate(predictions)
        print("Root-mean-square error = " + str(rmse))
    • To create a Scikit-learn model:
      1. Import Scikit-learn and PySpark packages:
        
        import pandas
        from pyspark import SparkContext
        from pyspark.sql import SQLContext
        
        from sklearn import preprocessing
        from sklearn.model_selection import train_test_split
        from sklearn.pipeline import Pipeline
        from sklearn.preprocessing import LabelEncoder
        from sklearn.tree import DecisionTreeClassifier
        from sklearn import metrics
        
      2. Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:
        
        sc = SparkContext.getOrCreate()
        sqlContext = SQLContext(sc)
        
        df = sqlContext.read.format("jdbc").options(
            driver='com.ibm.db2.jcc.DB2Driver',
            url='jdbc:db2://<url>:<port>/<location>',
            user='<userid>', password='<password>',
            dbtable='MLZ.TENTDATA').load().toPandas()
        
        print(df.head(5))

        Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

      3. Transform data, split the data to subsets for training and testing, construct the feature vector, and then use DecisionTreeClassifier to train the model:
        
        df['GENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER'])
        df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE'])
        df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS'])
        df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION'])
        df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY'])
        
        X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", 
                "COUNTRY_INDEX"]], df["TENT_LABEL"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
        print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0]))
        
        pipeline = Pipeline([('clf',DecisionTreeClassifier())])
        tentModelDT = pipeline.fit(X_train, y_train)
        
      4. Evaluate the model:
        
        expected = y_test
        predicted = tentModelDT.predict(X_test)
        
        print(metrics.classification_report(expected, predicted))
        print(metrics.confusion_matrix(expected, predicted))
        
    • To create a XGBoost model:
      1. Import PySpark, Scikit-learn, and XGBoost packages:
        
        import pandas
        from pyspark import SparkContext
        from pyspark.sql import SQLContext
        
        from sklearn.model_selection import train_test_split
        from sklearn.preprocessing import LabelEncoder
        import xgboost as xgb
        from sklearn.metrics import mean_squared_error
      2. Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:
        
        sc = SparkContext.getOrCreate()
        sqlContext = SQLContext(sc)
        
        df = sqlContext.read.format("jdbc").options(
            driver='com.ibm.db2.jcc.DB2Driver',
            url='jdbc:db2://<url>:<port>/<location>',
            user='<userid>', password='<password>',
            dbtable='MLZ.TENTDATA').load().toPandas()
        
        print(df.head(5))

        Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

      3. Transform data, split the data to training set and test subset, construct the dmatrix, and then use XGBoost to train the model:
        
        ENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER'])
        df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE'])
        df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS'])
        df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION'])
        df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY'])
        
        X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", 
               "COUNTRY_INDEX"]], df["label"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
        
        dtrain = xgb.DMatrix(X_train, label=y_train)
        dtest = xgb.DMatrix(X_test, label=y_test)
        print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0]))
        
        param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'reg:linear'}
        
        plst = param.items()
        model = xgb.train(plst, dtrain)
      4. Evaluate the model:
        
        expected = y_test
        predicted = model.predict(dtest)
        
        print('mse is: ' + str(mean_squared_error(expected, predicted)))
  5. Save the model to your local file system by using the WMLz Python model utility.
    • To save the Spark model:
      
      from wmlz.ml_model_util import MLModelUtil
      
      model_util = MLModelUtil()
      model_util.save(model,
          "./tentModelPY.tar.gz",
          training_data = trainingDF)
      print(“Model saved to local file system successfully”)
    • To save the Scikit-learn model:
      
      from wmlz.ml_model_util import MLModelUtil
      
      model_util = MLModelUtil()
      model_util.save(tentModelDT,
          "./tentModelDT.tar.gz",
          training_data=X_train,
          training_target=y_train)
      print(“Model saved to local file system successfully”)
    • To save the XGBoost model and the native XGBoost model schema:
      
      from wmlz.ml_model_util import MLModelUtil
      
      model_util = MLModelUtil()
      model_util.save(model,
          "./XGBModel.tar.gz",
          dmatrix=dtrain)
      print(“Model saved to local file system successfully”)
  6. Import the model from your local file system into WML for z/OS.
    1. Sign into the WML for z/OS web user interface with your user name and password.
    2. From the side bar, go to the Model Management page, select the Models tab, and click Import model.
    3. Specify a name for the new model and select MLz Model for model format.
    4. Browse, select, and upload the Spark (tentModelPY.tar.gz), Scikit-learn (tentModelDT.tar.gz), or XGBoost (XGBModel.tar.gz) model file.
    5. Click Import model to import the Spark, Scikit-learn, or XGBoost model.
  7. Verify that the imported Spark, Scikit-learn, or XGBoost model shows up on the Models tab of the Model Management page.