Machine learning models

The Watson Studio Local client provides tools to help you create and train machine learning models that can analyze data assets and extract value from them. Users can also deploy their models to make them available to a wider audience.

Tasks you can perform:

Create a model with APIs
Create a model from a file
Create a model with the model builder
Test a model online
Batch score a model
Evaluate a model

Watson Studio Local supports the following machine learning model types:

Spark ML
PMML with online scoring
Custom models with batch scoring
scikit-learn 0.19.1 (Python 2.7 and Python 3.5) - 0.19.1 (GPU-Python 3.5) with pickle or joblib format
XGBoost 0.7.post3 (Python 2.7 and 3.5) - 0.71 (GPU-Python 3.5)
Keras 2.1.3 (Python 2.7 and Python 3.5) - 2.1.5 (GPU-Python 3.5)
TensorFlow 1.5.0 (Python 2.7 and Python 3.5) - 1.4.1 (GPU-Python 3.5)
WML

Create a model with APIs

Watson Studio Local provides sample notebooks to help users create their own custom applications that can be powered by machine learning.

To learn more about the machine learning repository client API commands and syntax that you can use inside a notebook, see Save a model in Python and Create a model in HDP.

If you use these commands and syntax, be aware that for Watson Studio Local, the repository URL is static, and you do not need to authenticate to the repository.

Restriction: You can insert only CSV and JSON files into data frames for a model.

For each scoring iteration that you run from a notebook, Watson Studio Local automatically increments the model version. Later in the model details page, you can compare the accuracies of each version that you ran.

Accuracy history

Create a model from a file

You can import three types of models:

PMML: An XML file written in the Predictive Model Markup Language. The PMML is scored using JPMML Evaluator (make sure you review the "Supported model types", "Not yet supported model types" and "Known Limitations").
Custom Batch: A third-party vendor model in a compressed .gz format that will perform batch scoring. If you are using the Carolina tool, see Carolina for Hadoop or Carolina Standalone product page for more details.
Requirement: You must ZIP up all scripts and dependent models into a single .gz file before you import it. You can use the utility script provided at /user-home/.scripts/common-helpers/batch/custom/createcustombatch.sh to zip the files.
Custom Online: A third-party vendor model in a .jar format that will perform online scoring.
Requirement: If you are using the Carolina tool, run your scripts through it to generate the .jar file. See the Carolina for Integration product page for more details.
To perform the online scoring, place all third-party JAR files and license file into the /user-home/_global_/libs/ml/mlscoring/thirdparty/carolina/ folder of Watson Studio Local.

To import a model into your project from a file, complete the following steps:

In your project, go to the Models page and click add models.
In the Create Model window, click the From File tab.
Specify the name and description.
In the Type file, select what kind of model you are importing.
Browse to the file or drag it into the box.
Click the Create button.

Create a model with the model builder

Tip: If your model data exceeds 750 MB, use a Jupyter notebook instead to create and train the model. Otherwise, the model builder might time out during the model training.

To create a new model in your project, complete the following steps:

In your project, go to the Models page and click add models.
In the Add Model window, click the Blank tab. Specify the name and description. Select Machine Learning. Select whether you want to create the model automatically or manually and click Create to create an untrained model.
If you opt to create the model manually, a Prepare window opens where you can add and configure more transformers than just the default. A transformer acts on the data, usually by appending new columns and mapping existing data to the new column.
On the Select Data step, click your newly created model and select the data asset to run the model on. Ensure that none of the columns use boolean data types. You can also add new data assets. Click Next to load the data.
On the Prepare step, if you selected Manual when you created the model, then add and configure each transformer accordingly. Click Next.
On the Train step, select the column value to predict and the technique to train it with (Watson Studio Local will suggest the best one). You can add estimators to train on the data and produce a model for each one; then you can select the best trained model to deploy and use for predictions. You can also adjust the validation split to experiment with how much of the data to train, test, and hold out. Click Next to train and evaluate the model.
On the Evaluate step, select which trained model you want to keep and click Save to save it. Each time you save the model, its version is incremented. Later in the model details page, you can compare the accuracies of each version you ran.

You can also select the best version to publish.

Test a model online

In the Models page of your project, click Real-time score next to the model to input data and simulate predictions on it as a pie chart or bar graph.

Real-time score

Batch score a model

Restriction: Spark, PMML, and WML models do not support data sets with DECIMAL column types.

To run batch prediction jobs that read in a data set, score the data, and output the predictions in a CSV file, complete the following steps:

In the Models page of your project, click Batch Score next to the model.
Select the execution type, input data set, and output data set CSV file.
Restriction: WML models created in the visual model builder can only use a remote data set as the input data asset. Other types of models can use a local CSV file for the input.
Click the Generate Batch Script button. Watson Studio Local automatically generates a Python script that you can edit directly in the Result view.
Tip: This script can be customized to pre-process your data, for example, ensuring the case of the dataframe headers is suitable for ML models.
Click the Run now button to immediately create and run a job for the script. Alternatively, you can click Advanced settings to save the script as either a .py script or a .ipynb notebook in your project; then later from the Jobs page of your project, you can create a scheduled job for the script or notebook you saved with Type set to Batch scoring.
Restriction: If you select a GPU worker for the job, you can only batch score Keras models.

Requirement: If you are scoring a PMML model in Python 3.5 or later, you must specify environment variable SPARK_VERSION=2.1.

When the job runs, the output CSV file should appear as a data set in your project. Click Preview next to the data set to view the contents.

Tip: If the job reports Success but no CSV file was outputted, the job might have failed. Validate whether the input table exists by using the remote data set in the notebook.

Batch output

From the job details page, you can click on each run to view results and logs. You can also view a batch scoring history from the model details.

Evaluate a model

Restriction: Spark, PMML, and WML models do not support data sets with DECIMAL column types.

To evaluate the performance of a model, complete the following steps:

In the Models page of your project, click Evaluate next to the model.
Select an input data set that contains the prediction column. For each evaluator, you can opt to customize your own threshold metric and specify what fraction of the overall data must be relevant for the model to be considered healthy. For Spark 2.1 model evaluations, the output data set field is ignored.
Click the Generate Evaluation Script button. Watson Studio Local automatically generates a Python script that you can edit directly in the Result view.
Tip: You can customize this script to pre-process your data, for example, to ensure that the case of the data frame headers is suitable for ML models.
Click the Run now button to immediately create and run a job for the script. Alternatively, you can click Advanced settings to save the script as either a .py script or a .ipynb notebook in your project; then later from the Jobs page of your project, you can create a scheduled job for the script or notebook you saved with Type set to Model evaluation.

Requirement: If you are evaluating a PMML model in Python 3.5 or later, you must specify environment variable SPARK_VERSION=2.1.

From the job details page, you can click on each run to view results and logs. Go to the model details page to view the evaluation history.

Model type support
Depending on the type of model different languages, test API, and batch scores are supported.
Save and score a model in Python or Scala
Save and score a model in R
Save and score a model in HDP
Load a model in a notebook or script
Script batch scoring
Script an evaluation
Compare model versions in a model group
A model group is a collection of up to 10 machine learning model versions that can be deployed to production at the same time using the same scoring endpoint. Over time, the model versions can be evaluated against each other to assess the best ones to keep on your production environment.
Machine learning glossary
Watson Studio Local benchmarks
The following benchmarks compare the training performances (CPU, memory, total time) of different machine learning models on various Spark and scikit learn configurations.

Parent topic: Analyze data with Watson Studio Local