Supported algorithms, data sources, data types, and model types

IBM Watson® Machine Learning for z/OS supports various machine learning algorithms, data sources, data types, and model types that you can use to create, train, and deploy models.

Algorithms

The integrated Jupyter Notebook editor supports the following model algorithms:

All classification and regression algorithms that the Apache Spark MLlib supports. See z/OS Spark MLLib – Classification and regression 3.2 for a list of the supported classification and regression algorithms.
All PySpark classification and regression algorithms that the Apache Spark MLlib supports.
All clustering algorithms that the Apache Spark MLlib supports. See z/OS Spark MLLib – Clustering for a list of the supported clustering algorithms.
All PySpark clustering algorithms that the Apache Spark MLlib supports.
All Scikit-learn machine learning algorithms. See Scikit-learn machine learning algorithms for the list of supported Scikit-learn machine learning algorithms.
All machine learning algorithms that XGBoost Python API supports, with exception of GPU algorithms in XGBoost. See XGBoost Python Package for details of supported XGBoost algorithms.
All SnapML machine learning algorithms. See Snap ML machine learning algorithms for the list of supported SnapML machine learning algorithms.

Data sources

Data source support in WML for z/OS is determined by whether you use JDBC or IBM® Data Virtualization Manager for z/OS (DVM) as the data access method:

With JDBC, WML for z/OS supports access to the following data source in Scala and Python:

Db2® for z/OS

For example, you can use the Scala code in the following example to connect through JDBC to the TENTDATA table in the Notebook Editor:


val df = spark.read.format("jdbc").options(Map(
    "driver" -> "com.ibm.db2.jcc.DB2Driver",
    "url" -> "jdbc:db2://<url>:<port>/<location>",
    "user" -> "<userid>", 
    "password" -> "<password>", 
    "dbtable" -> "MLZ.TENTDATA")).load()

Where url and port are the IP address and port number of your Db2 host system, location is the location of your Db2 installation, and userid and password are your Db2 authorization ID and password.

You can also use the Python code in the following example to connect through PySpark to the TENTDATA table:


import pandas
# Import libraries required for reading data files
from pyspark import SparkContext
from pyspark.sql import SQLContext
    
sc = SparkContext.getOrCreate()
# Initialize SparkSQL Context
sqlContext = SQLContext(sc)

df = sqlContext.read.format("jdbc").options(driver=
     'com.ibm.db2.jcc.DB2Driver',url='jdbc:db2://
      <url>:<port>/<location>', user='<userid>', 
      password='<password>', dbtable='MLZ.TENTDATA).
      load().toPandas()

print(df.head(5))

Set <url>, <port>, <location>, <userid> and <password> to appropriate values based on the Db2 installation in your environment.

With MDS (z/OS Data Service), WML for z/OS supports access to the following data sources in Scala and Python:

Db2 for z/OS
IMS
SMF
VSAM data sets

For example, you can use the Scala code in the following example to connect through DVM to the TENTDATA table in the Notebook Editor:


val data = spark.read.format("jdbc").options(Map(
     "driver" -> "com.rs.jdbc.dv.DvDriver",
     "url" -> "jdbc:rs:dv://<url>:<port>",
     "user" -> "<userid>",
     "password" -> "<password>", 
     "dbty" -> "DVS", "dbtable" -> "MLZ.TENTDATA")).load()

You can also use the Python code in the following example to connect through PySpark to the DVM data source:


import pandas
# Import libraries required for reading data files
from pyspark import SparkContext
from pyspark.sql import SQLContext
    
sc = SparkContext.getOrCreate()
# Initialize SparkSQL Context
sqlContext = SQLContext(sc)

df = sqlContext.read.format("jdbc").options(driver=
	com.rs.jdbc.dv.DvDriver',url= jdbc:rs:dv://<url>:<port>', user='<userid>', password='<password>', dbty='DVS', dbtable='MLZ.TENTDATA).
	load().toPandas()

print(df.head(5))

Finally, you can ingest data from Db2 for z/OS and DVM sources by adding data asset connections into your notebook.

See IBM Data Virtualization Manager for z/OS for more information about working with data sources through DVM.

Data types

WML for z/OS supports all the data types that IBM z/OS Platform for Apache Spark supports.

Model types

The type of a machine learning model is determined by the scoring engine that is used for processing the model. WML for z/OS supports the following model types:

SparkML
MLeap
PMML
Scikit-learn
XGBoost
ARIMA or Seasonal ARIMA
ONNX

You can train, save, and deploy MLeap, SparkML, Scikit-learn, SnapML, XGBoost, ARIMA, and Seasonal ARIMA models in WML for z/OS:

If you create a model by using the Scala Notebook Editor, you can save the model as a SparkML, MLeap, or SparkML/MLeap model type. You can deploy the model as a SparkML or MLeap model.
If you create a model by using the Python Notebook Editor, you can save and deploy the model as a Scikit-learn or XGBoost model. While a native XGBoost model is saved and deployed as a XGBoost, a XGBoost Scikit-learn wrapper model is saved and deployed as a Scikit-learn model.
If you create a SnapML model by using the Python Notebook Editor, you can save the model as a PMML model type and deploy it as a SnapML model.

If you create a time series model by using the Python Notebook Editor, you can save and deploy it as an ARIMA or Seasonal ARIMA model type. Make sure that you use the repository SAVE API to save the model and specify the corresponding time series scoring engine (Time Series-Arima or Time Series-Seasonal Arima) for deployment.

You can also use the WML for z/OS to import a PMML, ONNX, or Watson Core Time Series model that is developed on another platform. You can save the imported model as a PMML, ONNX, or watfore model type and deploy it as a PMML, SnapML, ONNX, or watfore model type.

The WML for z/OS support of the model types has certain limitations:

Limitations in the models that are used for CICS and WOLA scoring:
- The field names in the model’s input and output should not exceed 16 characters in length due to the limitation of COBOL and the COPYBOOK generator.
Limitations in the MLeap and SparkML engine:
- The MLeap engine does not support the following z/OS Spark transformers or estimators:
  - org.apache.spark.ml.feature.PolynomialExpansion
  - org.apache.spark.ml.feature.OneHotEncoderModel
  - org.apache.spark.ml.feature.OneHotEncoder (for Spark 2.3.0 or later)
  - org.apache.spark.ml.feature.Normalizer
  - org.apache.spark.ml.feature.RobustScalerModel
  - org.apache.spark.ml.feature.SQLTransformer
  - org.apache.spark.ml.feature.VectorSizeHint
  - org.apache.spark.ml.feature.ImputerModel
  - org.apache.spark.ml.feature.RFormulaModel
  - org.apache.spark.ml.feature.UnivariateFeatureSelectorModel
  - org.apache.spark.ml.feature.VarianceThresholdSelectorModel
  - org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel
  - org.apache.spark.ml.feature.MinHashLSHModel
  - org.apache.spark.ml.classification.FMClassificationModel
  - org.apache.spark.ml.regression.FMRegressionModel
  - org.apache.spark.ml.clustering.LDAModel
  - org.apache.spark.ml.clustering.PowerIterationClustering
  - org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel
  - org.apache.spark.ml.tuning.CrossValidatorModel
  - org.apache.spark.ml.tuning.TrainValidationSplitModel
  If a pipeline contains any of these transformers or estimators, the pipeline will not be retained in the MLeap bundle, and the model will be saved as a SparkML type only.
- The MLeap engine does not support any customized z/OS Spark transformers or estimators except the CADS (Cognitive Assistant for Data Scientists) estimator.
- The MLeap engine does not support any model that has array, vector, map, or struct as a column data type. The model is saved as a SparkML type only.
- You cannot create a PMML model in WML for z/OS, but you can import and deploy a PMML model that you've already created elsewhere.
Limitations in the Scikit-learn engine:
- The Scikit-learn engine does not support the feedback evaluation, retraining, and batch scoring of a model.
- The Scikit-learn engine does not support a model if it does not contain a predict method, such as SpectralBiclustering and AgglomerativeClustering.
Limitations in the XGBoost engine:
- The XGBoost engine does not support feedback evaluation and batch scoring of a model.
- Using a XGBoost model with objective count:poisson for prediction may cause errors.
- WML for z/OS supports the conversion of a XGBoost model to PMML only if the XGBoost model has the following parameters:
  - booster
    - gbtree
  - objective
    - binary:logistic
    - count:poisson
    - multi:softmax
    - multi:softprob
    - reg:gamma
    - reg:logistic
    - reg:linear
    - reg:tweedie
- You can use the XGBoost Scikit-learn wrapper API within a Scikit-learn pipeline to train a model. If you want to convert the pipeline to PMML, all the operations in the pipeline must be supported by JPMML-SkLearn. See JPMML-SkLearn for details.
Limitations in the ONNX engine:
- You cannot create an ONNX model in WML for z/OS.
- The ONNX engine does not support feedback evaluation and retraining of a ONNX model.
- The ONNX engine does not support batch scoring of a ONNX model that reads a batch of records from a database and writes a batch of predictions back to the database.
The ONNX engines supports ONNX version 1.13.1 for operations targeting up to opset 18. See Supported ONNX Operation for Target cpu and Supported ONNX Operation for Target NNPA for more information about the supported operators and limitations.
Limitations in the watfore engine:
- Only Watson Core Time Series v2.14.2 is supported.
- The supported algorithms within the Watson Core Time Series forecasting model are limited to those that can be constructed by using the ForecastingModel.
Limitations in the SnapML engine:
- You can train a SnapML model in WML for z/OS. But if you want to save it in WML for z/OS, the only models that are currently supported are:
  - snapml.RandomForestClassifier
  - snapml.RandomForestRegressor
  - snapml.BoostingMachineClassifier
  - snapml.BoostingMachineRegressor
- The SnapML engine does not support the feedback evaluation, retraining, and batch scoring of a model.