Developing a model in the integrated Notebook Editor
Jupyter Notebook is a popular open source application for writing and executing code for data exploration and machine learning modeling. WML for z/OS integrates and enhances the easy-to-use interface with which you can easily develop, train, and evaluate a model.
Before you begin
- Create the TENTDATA table and load the test data as described in Preparing data for a model in Db2 for z/OS.
- Locate the following information:
- User name and password for the WML for z/OS web user interface.
- JDBC connection information, authorization ID, and password for the Db2 subsystem where the TENTDATA sample table is created.
- IP address of the host system where your WML for z/OS metadata service runs.
Procedure
- Sign into the WML for z/OS web user interface with your user name and password.
-
From the sidebar, navigate to
Projects - View all Projects
. -
If the Tent-Example-Project project does not already exist, click
Create Project. Enter
Tent-Example-Project as the project name and click
Create.
The new project opens to the overview page with links to Assets, Environments, Data Sources, and Collaborators.
- Click Assets (number + text) to open the
All
view of the assets. The assets are grouped by type, such asNotebooks
,RStudios
,Models
,SPSS Modeler Flows
, andData Sets
, into their own section and tab. The same user actions for an asset type are available in the section and on the tab. -
In the
Data Sets
section, click add data set to create a data source for the new project. Make sure that the new data set shows up on the list. -
Add a new notebook that uses the Scala
library. You can create a new notebook from scratch or by importing an existing notebook file.
- In Notebooks section, click add notebooks to create a new notebook.
-
Enter a name for the notebook, such as Tent-Notebook. Select
Scala for Language. Click Create.
The notebook is saved and opens in the Notebook Editor.
-
Select the new notebook and then Insert project context from the
ACTIONS
menu. -
In the notebook, enter the following sample Scala code, and click Run cell at
each step to train, evaluate, and save the model:
- Import z/OS Spark and WML for z/OS packages:
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer, VectorAssembler} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import com.ibm.analytics.ngp.repository_v3.{MetaNames, _}
- Read data from the MLZ.TENTDATA table in your Db2 for z/OS subsystem, split the data into groups for training, testing, and validation, and list the first five rows from the training group as a preview:
val df = spark.read.format("jdbc").options(Map( "driver" -> "com.ibm.db2.jcc.DB2Driver", "url" -> "jdbc:db2://<url>:<port>/<location>", "user" -> "<userid>", "password" -> "<password>", "dbtable" -> "MLZ.TENTDATA")).load() val train = 80 val test = 10 val validate = 10 val splits = df.randomSplit(Array (train / 100.0, test / 100.0, validate / 100.0)) val trainDF = splits(0) val testDF = splits(1) val validateDF = splits(2) trainDF.cache() println(trainDF.show(5))
Where url and port are the IP address and port number of your Db2 host system, and location is the location of your Db2 installation. userid and password are your Db2 authorization ID and password.
- Transform data, construct the feature vector, and then train the model using logistic regression:
val genderIndexer = new StringIndexer().setInputCol("GENDER"). setOutputCol("GENDER_INDEX") val maritalStatusIndexer = new StringIndexer().setInputCol ("MARITAL_STATUS").setOutputCol("MARITAL_STATUS_INDEX") val professionIndexer: StringIndexer = new StringIndexer(). setInputCol("PROFESSION").setOutputCol("PROFESSION_INDEX") val assembler = new VectorAssembler().setInputCols(Array ("GENDER_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", "AGE")).setOutputCol("features") val lr = new LogisticRegression().setMaxIter(500). setLabelCol("TENT_LABEL") var pipeline = new Pipeline().setStages(Array(genderIndexer, maritalStatusIndexer,professionIndexer,assembler,lr)) val tentModel = pipeline.fit(trainDF) print(tentModel)
- Evaluate the model:
val evaluator = new BinaryClassificationEvaluator() val metrics = evaluator.evaluate(tentModel.transform(testDF)) println("BinaryClassifier Evaluator: " + metrics)
- Save the model.
You can specify the
MetaNames.SCOPE
parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.val client = MLRepositoryClient(metaService) client.authorize(authToken) val mlRepositoryArtifact = MLRepositoryArtifact(tentModel,trainDF, "tentModel", MetaNames.DESCRIPTION -> "Tent Model", MetaNames.LABEL_FIELD -> "TENT_LABEL", MetaNames.MODEL_META_PROJECT_ID -> projectName, MetaNames.MODEL_META_ORIGIN_ID -> notebookName, MetaNames.MODEL_META_ORIGIN_TYPE -> "notebook", MetaNames.SCOPE -> "system") client.models.save(mlRepositoryArtifact) println("model saved successfully")
- Import z/OS Spark and WML for z/OS packages:
- Verify that the tentModel model shows up under the Models tab on the Model Management page.
-
Add a new notebook that uses the Scikit-learn library.
- In the Notebooks section, click add notebooks to add a new notebook.
- Enter a name for the notebook, select Python for Language. Click Create. The notebook is saved and opens in the Notebook Editor.
- Select the new notebook and then Insert project context from the
ACTIONS
menu. - Enter the required Python code in the
notebook shell, as shown in the following example, and click Run cell,
wherever needed, in the following sequence.
- Import PySpark and WML for z/OS packages:
import pandas from pyspark import SparkContext from pyspark.sql import SQLContext from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder from sklearn.tree import DecisionTreeClassifier from sklearn import metrics from repository_v3.mlrepository import MetaNames from repository_v3.mlrepository import MetaProps from repository_v3.mlrepositoryclient import MLRepositoryClient from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
- Read data from the
MLZ.TENTDATA
table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:sc = SparkContext.getOrCreate() # Initialize SparkSQL Context sqlContext = SQLContext(sc) df = sqlContext.read.format("jdbc").options( driver='com.ibm.db2.jcc.DB2Driver', url='jdbc:db2://<url>:<port>/<location>', user='<userid>', password='<password>', dbtable='MLZ.TENTDATA').load().toPandas() print(df.head(5))
Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.
- Transform data, split the data to training and test subsets, construct the feature vector, and then use
DecisionTreeClassifier
to train a model:df['GENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER']) df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE']) df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS']) df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION']) df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY']) X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", "COUNTRY_INDEX"]], df["TENT_LABEL"] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0])) pipeline = Pipeline([('clf',DecisionTreeClassifier())]) tentModelDT = pipeline.fit(X_train, y_train)
- Evaluate the model:
# make predictions expected = y_test predicted = tentModelDT.predict(X_test) # summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))
- Save the model. As shown in the following example, specify the
training_data
andtraining_target
parameters to save the Scikit-learn model schema. If the saved data type isDataFrame
, WML for z/OS uses the column names ofDataFrame
. Otherwise, it generates and uses default column names for the data. If you want to save the column names yourself, specify thefeature_names
andlabel_column_names
parameters when creating theMLRepositoryArtifact
object.val client = MLRepositoryClient(metaService) client.authorize_with_token(authToken) props1 = MetaProps({MetaNames.AUTHOR_NAME: "author", MetaNames.AUTHOR_EMAIL: "author@example.com", MetaNames.MODEL_META_PROJECT_ID: projectName, MetaNames.MODEL_META_ORIGIN_TYPE: "notebook", MetaNames.MODEL_META_ORIGIN_ID: notebookName, MetaNames.SCOPE: "system" }) input_artifact = MLRepositoryArtifact(tentModelDT, name="tentModelDT", meta_props=props1, training_data = X_train, training_target = y_train) client.models.save(artifact=input_artifact) print("model saved successfully")
You can specify the
MetaNames.SCOPE
parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.By default, the model is saved as a Scikit-learn model. Optionally, specify the
MetaNames.SAVE_TYPE
parameter and set it to "PMML" if you want to convert the model to PMML format. The model is saved as a PMML model if the conversion is successful. If the conversion fails due to a memory error, resolve the error and try again. See Resolving the out of memory error when converting a model from Python to PMML for instructions.
- Import PySpark and WML for z/OS packages:
- Verify that the
tentModelDT
model shows up under the Models tab on the Model Management page.
-
Add a new notebook that uses the XGBoost library.
- In the Notebooks section, click add notebooks to add a new notebook.
- Enter a name for the notebook, select Python for
Language
. Click Create to save the notebook and open it in the Notebook Editor. - Select the new notebook and then Insert project context from the
ACTIONS
menu. - Enter the required Python code in the
notebook shell, as shown in the following example, and click Run cell,
wherever needed, in the following sequence.
- Import PySpark and WML for z/OS packages:
import pandas from pyspark import SparkContext from pyspark.sql import SQLContext from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder import xgboost as xgb from sklearn.metrics import mean_squared_error from repository_v3.mlrepository import MetaNames from repository_v3.mlrepository import MetaProps from repository_v3.mlrepositoryclient import MLRepositoryClient from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
- Read data from the
MLZ.TENTDATA
table in your Db2 subsystem and list the first five rows from the DataFrame as a preview:sc = SparkContext.getOrCreate() # Initialize SparkSQL Context sqlContext = SQLContext(sc) df = sqlContext.read.format("jdbc").options( driver='com.ibm.db2.jcc.DB2Driver', url='jdbc:db2://<url>:<port>/<location>', user='<userid>', password='<password>', dbtable='MLZ.TENTDATA').load().toPandas() print(df.head(5))
Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.
- Transform data, split the data to subsets for training and testing, construct the
dmatrix
, and then use XGBoost to train the model:df['GENDER_INDEX'] = LabelEncoder().fit_transform(df['GENDER']) df['AGE_INDEX'] = LabelEncoder().fit_transform(df['AGE']) df['MARITAL_STATUS_INDEX'] = LabelEncoder().fit_transform(df['MARITAL_STATUS']) df['PROFESSION_INDEX'] = LabelEncoder().fit_transform(df['PROFESSION']) df['COUNTRY_INDEX'] = LabelEncoder().fit_transform(df['COUNTRY']) X, y = df[["GENDER_INDEX", "AGE_INDEX", "MARITAL_STATUS_INDEX", "PROFESSION_INDEX", "COUNTRY_INDEX"]], df["label"] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # Load dataset into DMatrix for native XGBoost model dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) print ("Training set has %d rows, test set has %d rows." %(X_train.shape[0], X_test.shape[0])) # Specify Booster parameters param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'reg:linear'} # Train native XGBoost model plst = param.items() model = xgb.train(plst, dtrain)
- Evaluate the model:
# make predictions expected = y_test predicted = model.predict(dtest) # summarize the fit of the model print('mse is: ' + str(mean_squared_error(expected, predicted)))
- Save the model. As shown in the following example, specify the
dmatrix
parameters to save the native XGBoost model schema.val client = MLRepositoryClient(metaService) client.authorize_with_token(authToken) props1 = MetaProps({MetaNames.AUTHOR_NAME: "author", MetaNames.AUTHOR_EMAIL: "author@example.com", MetaNames.MODEL_META_PROJECT_ID: projectName, MetaNames.MODEL_META_ORIGIN_TYPE: "notebook", MetaNames.MODEL_META_ORIGIN_ID: notebookName, MetaNames.SCOPE: "system" }) input_artifact = MLRepositoryArtifact(model, name="XGBModel2", meta_props=props1, dmatrix=dtrain) client.models.save(artifact=input_artifact) print("model saved successfully")
You can specify the
MetaNames.SCOPE
parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.By default, the model is saved as an XGBoost model. Optionally, specify the
MetaNames.SAVE_TYPE
parameter and set it to "PMML" if you want to convert the model to PMML format. The model is saved as a PMML model if the conversion is successful. If the conversion fails due to a memory error, resolve the error and try again. See Resolving the out of memory error when converting a model from Python to PMML for instructions.
- Import PySpark and WML for z/OS packages:
- Verify that the
XGBModel2
model shows up under the Models tab on the Model Management page.
-
Add a new notebook that uses the PySpark library.
- In the Notebooks section, click add notebooks to add a new notebook.
- Enter a name for the notebook, select Python for
Language
. Click Create. The notebook is saved and opens in the Notebook Editor. - Select the new notebook and then Insert project context from the
ACTIONS
menu. - Enter the required Python code in the
notebook shell, as shown in the following example, and click Run cell,
wherever needed, in the following sequence.
- Import PySpark and WML for z/OS packages:
from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import HashingTF, Tokenizer, StringIndexer, VectorAssembler from pyspark.ml.evaluation import RegressionEvaluator from pyspark.sql import SparkSession from repository_v3.mlrepository import MetaNames from repository_v3.mlrepository import MetaProps from repository_v3.mlrepositoryclient import MLRepositoryClient from repository_v3.mlrepositoryartifact import MLRepositoryArtifact
- Create the Spark session object and specify the URL of the Spark master:
spark = SparkSession.builder.appName("Python Spark SQL basic example").master(Spark Master).config("spark.some.config.option", "<some-value>").getOrCreate()
- Read data from the
MLZ.TENTDATA
table in your Db2 subsystem. List the first five rows from the DataFrame as a preview:df = spark.read.format("jdbc").options( driver='com.ibm.db2.jcc.DB2Driver', url='jdbc:db2://<url>:<port>/<location>', user='<userid>', password='<password>', dbtable='MLZ.TENTDATA' ).load() df.show(5)
Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.
- Transform data, split the data to training dataset and test dataset, construct the feature vector, and then use
DecisionTreeClassifier
to train a model:[trainingDF, validationDF] = df.randomSplit([0.7, 0.3]) genderIndexer = StringIndexer(inputCol="GENDER", outputCol="GENDER_Index") maritalStatusIndexer = StringIndexer(inputCol="MARITAL_STATUS", outputCol="MARITAL_STATUS_Index") professionIndexer = StringIndexer(inputCol="PROFESSION", outputCol="PROFESSION_Index") assembler = VectorAssembler(inputCols=["GENDER_Index", "MARITAL_STATUS_Index", "PROFESSION_Index", "AGE"], outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) pipeline = Pipeline(stages=[genderIndexer, maritalStatusIndexer, professionIndexer, assembler, lr]) tentModelPY = pipeline.fit(trainingDF)
- Evaluate the model:
predictions = tentModelPY.transform(validationDF) evaluator = RegressionEvaluator(metricName="rmse", labelCol="label",predictionCol="prediction") rmse = evaluator.evaluate(predictions) print("Root-mean-square error = " + str(rmse))
- Save the model. As shown in the following example, specify the
training_data
andtraining_target
parameters to save the Scikit-learn model schema. If the saved data type isDataFrame
, WML for z/OS uses the column names ofDataFrame
. Otherwise, it generates and uses default column names for the data. If you want to save the column names yourself, specify thefeature_names
andlabel_column_names
parameters when creating theMLRepositoryArtifact
object.val client = MLRepositoryClient(metaService) client.authorize_with_token(authToken) props1 = MetaProps({MetaNames.AUTHOR_NAME: "author", MetaNames.AUTHOR_EMAIL: "author@example.com", MetaNames.MODEL_META_PROJECT_ID: projectName, MetaNames.MODEL_META_ORIGIN_TYPE: "notebook", MetaNames.MODEL_META_ORIGIN_ID: notebookName, MetaNames.SCOPE: "system"}) input_artifact = MLRepositoryArtifact(tentModelPY, name="tentModelPY", meta_props=props1, training_data = trainingDF) client.models.save(artifact=input_artifact) print("model saved successfully")
You can specify the
MetaNames.SCOPE
parameter to indicate where you want to save the model. Set the parameter to "system" (default) to save the model directly to the repository service. Otherwise, set it to "project" to save the model to your local project. You can publish the model to the repository service later.
- Import PySpark and WML for z/OS packages:
- Verify that the
tentModelPY
model shows up under the Models tab on the Model Management page.