Data Science and MLOps use case

To operationalize data analysis and model creation, your enterprise needs integrated systems and processes. Cloud Pak for Data provides the processes and technologies to enable your enterprise to develop and deploy machine learning models and other data science applications.

Watch this video to see the data fabric use case for implementing a Data Science and MLOps solution in Cloud Pak for Data.

This video provides a visual method as an alternative to following the written steps in this documentation.

Challenges

Establishing Data Science and MLOps solutions for enterprises involves tackling these challenges:

Accessing high-quality data: Organizations need to provide easy access to high quality, governed data for data science teams who use the data to build models.
Operationalizing model building and deploying: Organizations need to implement repeatable processes to quickly and efficiently build and deploy models to production environments.
Monitoring and retraining models: Organizations need to automate the monitoring and retraining of models based on production feedback.

You can solve these challenges by implementing a data fabric on Cloud Pak for Data.

Example: Golden Bank's challenges

Follow the story of Golden Bank as it implements a Data Science and MLOps process to expand its business by offering low-rate mortgage renewals for online applications. Data scientists at Golden Bank need to create a mortgage approval model that avoids risk and treats all applicants fairly. They must also automate the model retraining to optimize model performance.

Process

To implement Data Science and MLOps for your enterprise, your organization can follow this process:

Prepare and share the data
Build and train models
Deploy models
Monitor models
Automate the AI lifecycle

The Watson Studio, Watson Machine Learning, Watson OpenScale, Watson Knowledge Catalog, IBM Watson Pipelines, and AI Factsheets services in Cloud Pak for Data provide all of the tools and processes that your organization needs to implement a Data Science and MLOps solution.

Image showing the flow of the data science use case

Data scientists can prepare their own data sets and share them in a catalog. The catalog serves as a feature store where your data scientist teams can find high-quality data assets with the features that they need. They can add data assets from a catalog into a project, where they collaborate to prepare, analyze, and model the data.

What you can use	What you can do	Best to use when
Data Refinery	Access and refine data from diverse data source connections. Materialize the resulting data sets as snapshots in time that might combine, join, or filter data for other data scientists to analyze and explore. Make the resulting data sets available in catalogs.	You need to visualize the data when you want to shape or cleanse it. You want to simplify the process of preparing large amounts of raw data for analysis.
Catalogs	Use catalogs in Watson Knowledge Catalog as a feature store to organize your assets to share among the collaborators in your organization. Take advantage of AI-powered semantic search and recommendations to help users find what they need.	Your users need to easily understand, collaborate, enrich, and access the high-quality data. You want to increase visibility of data and collaboration between business users. You need users to view, access, manipulate, and analyze data without understanding its physical format or location, and without having to move or copy it. You want users to enhance assets by rating and reviewing them.

Example: Golden Bank's catalog

The governance team leader creates a catalog, "Mortgage Approval Catalog" and adds the data stewards and data scientists as catalog collaborators. The data stewards publish the data assets that they created into the catalog. The data scientists find the data assets, curated by the data stewards, in the catalog and copy those assets to a project. In their project, the data scientists can refine the data to prepare it for training a model.

2. Build and train models

To get predictive insights based on your data, data scientists, business analysts, and machine learning engineers can build and train models. Data scientists use Cloud Pak for Data services to build the AI models, ensuring that the right algorithms and optimizations are used to make predictions that help to solve business problems.

What you can use	What you can do	Best to use when
AutoAI	Use AutoAI in Watson Studio to automatically select algorithms, engineer features, generate pipeline candidates, and train model pipeline candidates. Then, evaluate the ranked pipelines and save the best as models. Deploy the trained models to a space, or export the model training pipeline that you like from AutoAI into a notebook to refine it.	You want an advanced and automated way to build a good set of training pipelines and models quickly. You want to be able to export the generated pipelines to refine them.
Notebooks and scripts	Use notebooks and scripts in Watson Studio to write your own feature engineering model training and evaluation code in Python or R. Use training data sets that are available in the project, or connections to data sources such as databases, data lakes, or object storage. Use your favorite open source frameworks and libraries.	You want to use Python or R coding skills to have full control over the code that is used to create, train, and evaluate the models.
SPSS Modeler flows	Use SPSS Modeler flows in Watson Studio to create your own model training, evaluation, and scoring flows. Use training data sets that are available in the project, or connections to data sources such as databases, data lakes, or object storage.	You want to visually code on a graphical builder. You want to create repeatable flows to explore data and define model training, evaluation, and scoring.
RStudio Server with R 3.6	Analyze data and build and test models by working with R in an RStudio Server development environment.	You want to use a development environment to work in R.
JupiterLab IDE	Analyze data and build and test models by working with the JupiterLab development environment.	You want to use a development environment to work in Python.
Visual Studio Code editor	Use the Watson Studio extension to connect to a Cloud Pak for Data cluster directly from Visual Studio Code. Using the extension, you can start and stop your runtimes, securely connect to your runtimes on the cluster through SSH, and edit the files inside your Watson Studio Git-based project through SSH.	You want to edit and run code in Visual Studio Code.
Watson Machine Learning Accelerator	Train neural networks by using a deep learning experiment builder.	You want to train thousands of models, train deeper neural networks, and explore more complicated hyperparameter spaces.
Decision Optimization	Prepare data, import models, solve problems and compare scenarios, visualize data, find solutions, produce reports, and save models to deploy with Watson Machine Learning.	You need to evaluate millions of possibilities to find the best solution to a prescriptive analytics problem.
Analytics Engine powered by Apache Spark	Run Jupyter notebooks and jobs from other tools in Watson Studio projects by selecting a Spark environment runtime. Run Spark SQL or jobs for data transformation, data science, or machine learning by using Spark job APIs.	You have a Spark cluster for running distributed jobs.
Federated learning	Train a common model that uses distributed data.	You need to train a model without moving, combining, or sharing data that is distributed across multiple locations.

Example: Golden Bank's model building and training

Data scientists at Golden Bank create a model, "Mortgage Approval Model" that avoids unanticipated risk and treats all applicants fairly. They want to track the history and performance of the model from the beginning, so they add a model use case to the "Mortgage Approval Catalog". They run a notebook to build the model and predict which applicants qualify for mortgages. The details of the model training are automatically captured as metadata in the model use case.

3. Deploy models

When operations team members deploy your AI models, the models become available for applications to use for scoring and predictions to help drive actions.

What you can use	What you can do	Best to use when
Spaces user interface (UI)	Use the Spaces UI to deploy models and other assets from projects to spaces.	You want to deploy models and view deployment information in a collaborative workspace.
Command-line tool (cpdctl)	Use the cpdctl command-line tool in Watson Machine Learning to manage the lifecycle of models and to automate an end-to-end flow that includes training the model, saving it, creating a deployment space, and deploying the model.	You want to deploy and manage models to test or production environments from a command-line.

Example: Golden Bank's model deployment

The operations team members at Golden Bank promote the "Mortgage Approval Model" from the project to a deployment space and then creates an online model deployment.

4. Monitor deployed models

After models are deployed, it is important to monitor them to make sure that they are performing well. Data scientists must watch for model performance and data consistency issues.

What you can use	What you can do	Best to use when
Watson OpenScale	Monitor model fairness issues across multiple features. Monitor model performance and data consistency over time. Explain how the model arrived at certain predictions with weighted factors. Maintain and report on model governance and lifecycle across your organization.	You have features that are protected or that might contribute to prediction fairness. You want to trace model performance and data consistencies over time. You want to know why the model gives certain predictions.

Example: Golden Bank's model monitoring

Data scientists at Golden Bank use Watson OpenScale to monitor the deployed "Mortgage Approval Model" to make sure that it is accurate and treating all Golden Bank mortgage applicants fairly. They run a notebook to set up monitors for the model and then tweak the configuration by using the Watson OpenScale user interface. Using metrics from the Watson OpenScale quality monitor and fairness monitor, the data scientists determine how well the model predicts outcomes and if it produces any biased outcomes. They also get insights for how the model comes to decisions so that the decisions can be explained to the mortgage applicants.

5. Automate the ML lifecycle

Your team can automate and simplify the MLOps and AI lifecycle with Watson Pipelines.

What you can use	What you can do	Best to use when
Watson Pipelines	Use pipelines to create repeatable and scheduled flows that automate notebook, Data Refinery, and machine learning pipelines, from data ingestion to model training, testing, and deployment.	You want to automate some or all of the steps in an MLOps flow.

Example: Golden Bank's automated ML lifecycle

The data scientists at Golden Bank can use pipelines to automate their complete Data Science and MLOps lifecycle and processes to simplify the model retraining process.

Tutorials for Data Science and MLOps

Tutorial	Description	Expertise for tutorial
Orchestrate an AI pipeline with model monitoring	Train a model, promote it to a deployment space, and deploy the model.	Run a notebook.
Orchestrate an AI pipeline with data integration	Create an end-to-end pipeline that prepares data and trains a model.	Use the Watson Pipelines drag and drop interface to create a pipeline.

Learn more

Parent topic: Data fabric solution overview