Developing a model in the integrated Notebook Editor

Jupyter Notebook is a popular open source application for writing and executing code for data exploration and machine learning modeling. WML for z/OS integrates and enhances the easy-to-use interface with which you can easily develop, train, and evaluate a model.

Before you begin

  • Create the TENTDATA table and load the test data as described in Preparing data for a model in Db2 for z/OS.
  • Locate the following information:
    • User name and password for the WML for z/OS web user interface.
    • JDBC connection information, authorization ID, and password for the Db2 subsystem where the TENTDATA sample table is created.
    • IP address of the host system where your WML for z/OS metadata service runs.

Procedure

  1. Sign into the WML for z/OS web user interface with your user name and password.
  2. From the sidebar, navigate to Projects - View all Projects.
  3. If the Tent-Example-Project project does not already exist, click Create Project. Enter Tent-Example-Project as the project name and click Create.
    The new project opens to the overview page with links to Assets, Environments, Data Sources, and Collaborators.
  4. Click Assets (number + text) to open the All view of the assets. The assets are grouped by type, such as Notebooks, RStudios, Models, SPSS Modeler Flows, and Data Sets, into their own section and tab. The same user actions for an asset type are available in the section and on the tab.
  5. In the Data Sets section, click add data set to create a data source for the new project. Make sure that the new data set shows up on the list.
  6. Add a new notebook that uses the Scala library. You can create a new notebook from scratch or by importing an existing notebook file.
    1. In Notebooks section, click add notebooks to create a new notebook.
    2. Enter a name for the notebook, such as Tent-Notebook. Select Scala for Language. Click Create.
      The notebook is saved and opens in the Notebook Editor.
    3. Select the new notebook and then Insert project context from the ACTIONS menu.
    4. In the notebook, enter the following sample Scala code, and click Run cell at each step to train, evaluate, and save the model:
      1. Import z/OS Spark and WML for z/OS packages:
        
        import org.apache.spark.ml.feature.{StringIndexer, 
               IndexToString, VectorIndexer, VectorAssembler}
        import org.apache.spark.ml.classification.LogisticRegression
        import org.apache.spark.ml.{Pipeline, PipelineStage}
        import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
        import com.ibm.analytics.ngp.repository_v3.{MetaNames, _}
      2. Read data from the MLZ.TENTDATA table in your Db2 for z/OS subsystem, split the data into groups for training, testing, and validation, and list the first five rows from the training group as a preview:
        val df = spark.read.format("jdbc").options(Map(
            "driver" -> "com.ibm.db2.jcc.DB2Driver",
            "url" -> "jdbc:db2://<url>:<port>/<location>", "user" -> "<userid>", 
            "password" -> "<password>", "dbtable" -> "MLZ.TENTDATA")).load()
        
        val train = 80
        val test = 10
        val validate = 10
        
        val splits = df.randomSplit(Array
            (train / 100.0, test / 100.0, validate / 100.0))
        
        val trainDF = splits(0)
        val testDF = splits(1)
        val validateDF = splits(2)
        
        trainDF.cache()
        println(trainDF.show(5))
        

        Where url and port are the IP address and port number of your Db2 host system, and location is the location of your Db2 installation. userid and password are your Db2 authorization ID and password.

      3. Transform data, construct the feature vector, and then train the model using logistic regression:
        
        val genderIndexer = new StringIndexer().setInputCol("GENDER").
            setOutputCol("GENDER_INDEX")
        val maritalStatusIndexer = new StringIndexer().setInputCol
           ("MARITAL_STATUS").setOutputCol("MARITAL_STATUS_INDEX")
        val professionIndexer: StringIndexer = new StringIndexer().
           setInputCol("PROFESSION").setOutputCol("PROFESSION_INDEX")
        val assembler = new VectorAssembler().setInputCols(Array
           ("GENDER_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX",
            "AGE")).setOutputCol("features")
        val lr = new LogisticRegression().setMaxIter(500).
            setLabelCol("TENT_LABEL")
        
        var pipeline = new Pipeline().setStages(Array(genderIndexer,
             maritalStatusIndexer,professionIndexer,assembler,lr))
        val tentModel = pipeline.fit(trainDF)
        
        print(tentModel)
      4. Evaluate the model:
        
        val evaluator = new BinaryClassificationEvaluator()
        val metrics = evaluator.evaluate(tentModel.transform(testDF))
        println("BinaryClassifier Evaluator: " + metrics)
      5. Save the model.

        You can specify the MetaNames.SCOPE parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.

         
        val client = MLRepositoryClient(metaService)
            client.authorize(authToken)
        val mlRepositoryArtifact = MLRepositoryArtifact(tentModel,trainDF,
            "tentModel",
            MetaNames.DESCRIPTION -> "Tent Model", 
            MetaNames.LABEL_FIELD -> "TENT_LABEL",
            MetaNames.MODEL_META_PROJECT_ID -> projectName,
            MetaNames.MODEL_META_ORIGIN_ID -> notebookName,
            MetaNames.MODEL_META_ORIGIN_TYPE -> "notebook",
            MetaNames.SCOPE -> "system")
        client.models.save(mlRepositoryArtifact)
        println("model saved successfully")
    5. Verify that the tentModel model shows up under the Models tab on the Model Management page.
  7. Add a new notebook that uses the Scikit-learn library.
    1. In the Notebooks section, click add notebooks to add a new notebook.
    2. Enter a name for the notebook, select Python for Language. Click Create. The notebook is saved and opens in the Notebook Editor.
    3. Select the new notebook and then Insert project context from the ACTIONS menu.
    4. Enter the required Python code in the notebook shell, as shown in the following example, and click Run cell, wherever needed, in the following sequence.
      1. Import PySpark and WML for z/OS packages:
        
        import pandas
        from pyspark import SparkContext
        from pyspark.sql import SQLContext
        
        from sklearn import preprocessing
        from sklearn.model_selection import train_test_split
        from sklearn.pipeline import Pipeline
        from sklearn.preprocessing import LabelEncoder
        from sklearn.tree import DecisionTreeClassifier
        from sklearn import metrics
        
        from repository_v3.mlrepository import MetaNames
        from repository_v3.mlrepository import MetaProps
        from repository_v3.mlrepositoryclient import MLRepositoryClient
        from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
      2. Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:
        
        sc = SparkContext.getOrCreate()
        # Initialize SparkSQL Context
        sqlContext = SQLContext(sc)
        
        df = sqlContext.read.format("jdbc").options(
            driver='com.ibm.db2.jcc.DB2Driver',
            url='jdbc:db2://<url>:<port>/<location>',
            user='<userid>', password='<password>',
            dbtable='MLZ.TENTDATA').load().toPandas()
        
        print(df.head(5))
        

        Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

      3. Transform data, split the data to training and test subsets, construct the feature vector, and then use DecisionTreeClassifier to train a model:
        
        df['GENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER'])
        df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE'])
        df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS'])
        df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION'])
        df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY'])
        
        X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", 
              "COUNTRY_INDEX"]], df["TENT_LABEL"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
        print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], 
              X_test.shape[0]))
        
        pipeline = Pipeline([('clf',DecisionTreeClassifier())])
        tentModelDT = pipeline.fit(X_train, y_train)
      4. Evaluate the model:
        
        # make predictions
        expected = y_test
        predicted = tentModelDT.predict(X_test)
        # summarize the fit of the model
        print(metrics.classification_report(expected, predicted))
        print(metrics.confusion_matrix(expected, predicted))
      5. Save the model. As shown in the following example, specify the training_data and training_target parameters to save the Scikit-learn model schema. If the saved data type is DataFrame, WML for z/OS uses the column names of DataFrame. Otherwise, it generates and uses default column names for the data. If you want to save the column names yourself, specify the feature_names and label_column_names parameters when creating the MLRepositoryArtifact object.
        val client = MLRepositoryClient(metaService)
        client.authorize_with_token(authToken)
        props1 = MetaProps({MetaNames.AUTHOR_NAME: "author",
              MetaNames.AUTHOR_EMAIL: "author@example.com",
              MetaNames.MODEL_META_PROJECT_ID: projectName,
              MetaNames.MODEL_META_ORIGIN_TYPE: "notebook",
              MetaNames.MODEL_META_ORIGIN_ID: notebookName,
              MetaNames.SCOPE: "system" })
        input_artifact = MLRepositoryArtifact(tentModelDT, 
              name="tentModelDT", meta_props=props1, 
              training_data = X_train, training_target = y_train)
        client.models.save(artifact=input_artifact)
        print("model saved successfully")

        You can specify the MetaNames.SCOPE parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.

        By default, the model is saved as a Scikit-learn model. Optionally, specify the MetaNames.SAVE_TYPE parameter and set it to "PMML" if you want to convert the model to PMML format. The model is saved as a PMML model if the conversion is successful. If the conversion fails due to a memory error, resolve the error and try again. See Resolving the out of memory error when converting a model from Python to PMML for instructions.

    5. Verify that the tentModelDT model shows up under the Models tab on the Model Management page.
  8. Add a new notebook that uses the XGBoost library.
    1. In the Notebooks section, click add notebooks to add a new notebook.
    2. Enter a name for the notebook, select Python for Language. Click Create to save the notebook and open it in the Notebook Editor.
    3. Select the new notebook and then Insert project context from the ACTIONS menu.
    4. Enter the required Python code in the notebook shell, as shown in the following example, and click Run cell, wherever needed, in the following sequence.
      1. Import PySpark and WML for z/OS packages:
        
        import pandas
        from pyspark import SparkContext
        from pyspark.sql import SQLContext
        
        from sklearn.model_selection import train_test_split
        from sklearn.preprocessing import LabelEncoder
        import xgboost as xgb
        from sklearn.metrics import mean_squared_error
        
        from repository_v3.mlrepository import MetaNames
        from repository_v3.mlrepository import MetaProps
        from repository_v3.mlrepositoryclient import MLRepositoryClient
        from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
        
      2. Read data from the MLZ.TENTDATA table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:
        
        sc = SparkContext.getOrCreate()
        # Initialize SparkSQL Context
        sqlContext = SQLContext(sc)
        
        df = sqlContext.read.format("jdbc").options(
            driver='com.ibm.db2.jcc.DB2Driver',
            url='jdbc:db2://<url>:<port>/<location>',
            user='<userid>', password='<password>',
            dbtable='MLZ.TENTDATA').load().toPandas()
        
        print(df.head(5))
        

        Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

      3. Transform data, split the data to subsets for training and testing, construct the dmatrix, and then use XGBoost to train the model:
        
        df['GENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER'])
        df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE'])
        df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS'])
        df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION'])
        df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY'])
        
        X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", 
              "COUNTRY_INDEX"]], df["label"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
        
        # Load dataset into DMatrix for native XGBoost model
        dtrain = xgb.DMatrix(X_train, label=y_train)
        dtest = xgb.DMatrix(X_test, label=y_test)
        print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], 
              X_test.shape[0]))
        
        # Specify Booster parameters
        param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'reg:linear'}
        
        # Train native XGBoost model
        plst = param.items()
        model = xgb.train(plst, dtrain)
        
      4. Evaluate the model:
        
        # make predictions
        expected = y_test
        predicted = model.predict(dtest)
        # summarize the fit of the model
        print('mse is: ' + str(mean_squared_error(expected, predicted)))
        
      5. Save the model. As shown in the following example, specify the dmatrix parameters to save the native XGBoost model schema.
        
        val client = MLRepositoryClient(metaService)
        client.authorize_with_token(authToken)
        props1 = MetaProps({MetaNames.AUTHOR_NAME: "author",
              MetaNames.AUTHOR_EMAIL: "author@example.com",
              MetaNames.MODEL_META_PROJECT_ID: projectName,
              MetaNames.MODEL_META_ORIGIN_TYPE: "notebook",
              MetaNames.MODEL_META_ORIGIN_ID: notebookName,
              MetaNames.SCOPE: "system" })
        input_artifact = MLRepositoryArtifact(model, name="XGBModel2", 
              meta_props=props1, dmatrix=dtrain)
        client.models.save(artifact=input_artifact)
        print("model saved successfully")
        
        

        You can specify the MetaNames.SCOPE parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.

        By default, the model is saved as an XGBoost model. Optionally, specify the MetaNames.SAVE_TYPE parameter and set it to "PMML" if you want to convert the model to PMML format. The model is saved as a PMML model if the conversion is successful. If the conversion fails due to a memory error, resolve the error and try again. See Resolving the out of memory error when converting a model from Python to PMML for instructions.

    5. Verify that the XGBModel2 model shows up under the Models tab on the Model Management page.
  9. Add a new notebook that uses the PySpark library.
    1. In the Notebooks section, click add notebooks to add a new notebook.
    2. Enter a name for the notebook, select Python for Language. Click Create. The notebook is saved and opens in the Notebook Editor.
    3. Select the new notebook and then Insert project context from the ACTIONS menu.
    4. Enter the required Python code in the notebook shell, as shown in the following example, and click Run cell, wherever needed, in the following sequence.
      1. Import PySpark and WML for z/OS packages:
        
        from pyspark.ml import Pipeline
        from pyspark.ml.classification import LogisticRegression
        from pyspark.ml.feature import HashingTF, Tokenizer, StringIndexer, 
             VectorAssembler
        from pyspark.ml.evaluation import RegressionEvaluator
        from pyspark.sql import SparkSession
        from repository_v3.mlrepository import MetaNames
        from repository_v3.mlrepository import MetaProps
        from repository_v3.mlrepositoryclient import MLRepositoryClient
        from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
      2. Create the Spark session object and specify the URL of the Spark master:
        
        spark = SparkSession.builder.appName("Python Spark SQL basic 
            example").master(Spark Master).config("spark.some.config.option", 
            "<some-value>").getOrCreate()
        
      3. Read data from the MLZ.TENTDATA table in your Db2 subsystem. List the first five rows from the DataFrame as a preview:
        
        df = spark.read.format("jdbc").options(
            driver='com.ibm.db2.jcc.DB2Driver',
            url='jdbc:db2://<url>:<port>/<location>',
            user='<userid>', password='<password>',
            dbtable='MLZ.TENTDATA'
            ).load()
        df.show(5)

        Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

      4. Transform data, split the data to training dataset and test dataset, construct the feature vector, and then use DecisionTreeClassifier to train a model:
        
        [trainingDF, validationDF] = df.randomSplit([0.7, 0.3])
        genderIndexer = StringIndexer(inputCol="GENDER", 
            outputCol="GENDER_Index")
        maritalStatusIndexer = StringIndexer(inputCol="MARITAL_STATUS", 
            outputCol="MARITAL_STATUS_Index")
        professionIndexer = StringIndexer(inputCol="PROFESSION", 
            outputCol="PROFESSION_Index")
        assembler = VectorAssembler(inputCols=["GENDER_Index", 
            "MARITAL_STATUS_Index", "PROFESSION_Index", "AGE"],
            outputCol="features")
        lr = LogisticRegression(maxIter=10, regParam=0.3, 
            elasticNetParam=0.8)
        pipeline = Pipeline(stages=[genderIndexer, 
            maritalStatusIndexer, professionIndexer, assembler, lr])
        tentModelPY = pipeline.fit(trainingDF)
        
      5. Evaluate the model:
        
        predictions = tentModelPY.transform(validationDF)
        evaluator = RegressionEvaluator(metricName="rmse", 
            labelCol="label",predictionCol="prediction")
        
        rmse = evaluator.evaluate(predictions)
        print("Root-mean-square error = " + str(rmse))
      6. Save the model. As shown in the following example, specify the training_data and training_target parameters to save the Scikit-learn model schema. If the saved data type is DataFrame, WML for z/OS uses the column names of DataFrame. Otherwise, it generates and uses default column names for the data. If you want to save the column names yourself, specify the feature_names and label_column_names parameters when creating the MLRepositoryArtifact object.
        
        val client = MLRepositoryClient(metaService)
        client.authorize_with_token(authToken)
        props1 = MetaProps({MetaNames.AUTHOR_NAME: "author",
              MetaNames.AUTHOR_EMAIL: "author@example.com",
              MetaNames.MODEL_META_PROJECT_ID: projectName,
              MetaNames.MODEL_META_ORIGIN_TYPE: "notebook",
              MetaNames.MODEL_META_ORIGIN_ID: notebookName,
              MetaNames.SCOPE: "system"})
        input_artifact = MLRepositoryArtifact(tentModelPY, 
              name="tentModelPY", meta_props=props1, 
              training_data = trainingDF)
        client.models.save(artifact=input_artifact)
        print("model saved successfully")
        

        You can specify the MetaNames.SCOPE parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.

    5. Verify that the tentModelPY model shows up under the Models tab on the Model Management page.