As enterprises move from experimenting with artificial intelligence (AI) to adopting it in production, AI model lifecycle management is quickly becoming the next frontier in development and research.
The “AI Model Lifecycle Management: Overview” blog post clearly outlines the need for enterprises to follow a well-defined and robust methodology for developing, deploying, and managing artificial intelligence (AI) models. Establishing this methodology can increase the overall “throughput” of data science activities by streamlining end-to-end tasks, optimizing the time spent by valuable data scientists, facilitating the use of talent pools, reducing time spent on mundane data tasks, and increasing productivity and collaboration.
The AI Ladder offers an intuitive framework for organizations to define their AI strategy and better understand how to address the challenges in developing, deploying, monitoring, and managing AI models. In this post, we focus on the tasks, roles, and tools associated with the Build AI Models phase, a major component of the Analyze rung of the AI Ladder as highlighted in Figure 1:
Traditional data science activities — which are well captured with the CRISP-DM methodology illustrated in Figure 2 below — include executing the following steps in an iterative manner until project goals are achieved:
- Business Understanding
- Data Understanding
- Data Preparation
- Model Training (or Modeling)
- Model Evaluation
- Model Deployment
However, improving enterprise data science throughput requires expanding the AI Model Lifecycle Management beyond the CRISP-DM steps to include the tasks of data collection, data governance on the frontend (before training AI models) and the tasks of monitoring deployed AI models for quality, fairness and explainability on the backend (after AI models are deployed):
As enterprises embed AI models in various business processes and customer interactions, there is an increased focus on speeding up time to value from AI models while monitoring models deployed in production for quality, fairness, and explainability to deliver trust and transparency. To achieve faster time to value, enterprises are adopting agile practices in AI model development (similar to agile practices in software development). This has led to the growth of data science and AI platforms that support agile practices and collaboration because data science is a team sport involving multiple roles that collaborate to develop and deliver AI models. To further support the agile practice, open source libraries and frameworks are being increasingly adopted by enterprise data science teams for developing AI models as part of the unified environment.
Additionally, the increased adoption of AI by enterprises and the scarcity of expert data science skills have led to the emergence of the role of citizen data scientists and the development of new tools for automatically building and training machine learning and AI models, referred to as AutoAI or AutoML. Supporting AutoAI capabilities is another key requirement in data science and AI platforms.
Lastly, for AI models to succeed, business leaders need to be able to trust the predictions of these models and methods of monitoring AI models in production for important performance metrics, including drift, quality, fairness and explainability. These monitoring tools surface alerts and results in easily consumable dashboards to help the business and operations teams track and continually improve the performance of deployed models.
To summarize, a modern data science and AI platform must support collaboration, agile AI model development practices, open source frameworks, AutoAI capabilities, and monitoring tools for trust and transparency.
The Build AI Models phase of AI Model Lifecycle Management usually consists of following five steps:
- Business understanding: Data scientists communicate a lot with stakeholders and SMEs to identify the business problem to be solved and criteria of success. These criteria should include both model performance metrics and business KPIs to be improved by leveraging AI models.
- Data acquisition: Data scientists start exploring and understanding available data. They can leverage data stored in catalogs or create requests for data engineers to provision new data.
- Exploratory data analysis and data cleaning: Data scientists experiment with various data visualizations and summarizations to get a sound understanding of available data. This is because real-world data is often noisy, incomplete, and may contain wrong or missing values. Using such data as-is can lead to poor models and wrong predictions. This step requires a lot of contact with subject matter experts (SMEs) to formulate the right hypothesis about the data, identify and fix wrong or missing values, and remove the discrepancies in different data sets. Moreover, visualizations are required to identify available patterns in the data and select an appropriate predictive model. In some cases, available data is not informative enough and thus, predictive models cannot be built (or if built, would perform poorly) due to lack of useful patterns in the data. Data scientists will have to work with data engineers and SMEs to find new data sources and collect more accurate and relevant data. There exists a rich set of open source frameworks for data visualization, such as matplotlib, seaborn, ggplot, plotly, bokeh and more.
- Feature engineering: Data scientists commonly apply “feature engineering,” which is the task of defining and deriving new features from raw (original) data features to train better performing AI models. The feature engineering step includes aggregation and transformation of raw variables to create the features used in the analysis and prediction. Features in the original data may not have sufficient predictive influence and by deriving new features, data scientists train AI models that deliver better performance.
- Modeling: Data scientists train machine learning models using the cleansed data prepared in the previous steps. They will train several machine learning models, evaluate them using a holdout data set (data not used at training time), and select the best model or multiple models (ensemble) to be deployed in the next phase. Model building usually also includes hyperparameter optimization step, which aims at selecting the best set of model hyperparameters (i.e., parameters of the model itself that are set before training starts to further increase model performance).
Many state-of-the-art models — including bagging, boosting, and neural networks — are quite complex, and decisions made by such models cannot be easily explained by looking at model parameters (such as in linear regression). There were many algorithms developed to get an understanding of model behavior, such as LIME, SHAP, and others. They are usually applied at model evaluation time to understand what influenced the prediction made by the model.
Depending on the skill level of data scientists involved in the project, the tooling for model development can vary. Experienced data scientists with strong coding skills usually prefer Python or R languages because there are many frameworks developed that implement various popular models, such as scikit-learn, Tensorflow, PyTorch, Keras and others. For processing of large amounts of data, Apache Spark, an open source distributed cluster-computing framework is preferred. As IDE (Integrated Development Environment), data scientists usually utilize open-source Jupyter Notebook/Lab for Python and RStudio for R. To collaborate and to share code, data science teams use version control systems like Git or a git-based version control system for machine learning like DVC.
Data scientists who prefer low-code environments can also leverage drag and drop functionality to develop pipelines for data pre-processing, feature engineering, and model development.
With the increased adoption of AI by enterprises and the emergence of the role of citizen data scientists, new tools are being developed for automatically training machine learning and deep learning models, referred to as AutoAI or AutoML. Citizen and expert data scientists can significantly speed up their exploration with AutoAI/AutoML functionalities that automate several aspects of the AI pipeline, including feature transformation, feature engineering, algorithm selection, and hyperparameter optimization.
Watson Studio and the IBM Cloud Pak® for Data
- A collaborative environment with Git integration support, which is important to the lifecycle on concurrent work for asset collaboration and version control.
- Out-of-the-box support for most popular open source tools, libraries, and frameworks, with the flexibility to customize environments as needed.
- AutoAI capabilities to enable citizen data scientists and improve productivity of expert data scientists delivering gains in data science throughput for the organization.
Additionally, Watson Studio supports no-code tools and popular IBM proprietary data science tools like SPSS modeler. This facilitates frictionless upgrade from SPSS to Watson Studio so SPSS users can continue to take advantage of SPSS modeler for training models, but also expand their usage to leverage all of Watson Studio’s capabilities, including AutoAI and support for open source frameworks.
In the Build AI Models phase of the AI Model Lifecycle, data scientists leverage Watson Studio to collaborate with other data scientists and data engineers to build and train AI models. It is a recommended best practice for data scientists to work with data assets from the enterprise data catalog that have been collected, curated, and governed in the Collect and Organize phases as outlined in Figure 1 above.
Additionally, Watson Studio supports more than 40 connectors to popular data sources that data scientists leverage to access relevant data sets. Data scientists then explore and experiment with training different models and evaluating them to identify the best model for a given use case. They do so in the context of a project, which provides a mechanism for organizing and isolating resources like data sets, notebooks, models, and experiments.
For a given business use case, data scientists don’t train one AI model, but rather tens or hundreds of models before they can identify the best model. To support that and maximize the productivity of the data science team, Watson Studio supports multiple modes of collaboration as explained in Watson Studio 2.1 lab instructions:
- Option 1: Local collaboration (no Git)
- Option 2: Collaboration via Git for all assets with the exception of JupyterLab
- Option 3: JupyterLab collaboration with Git
Option 1: Local collaboration (no Git)
When a team works in local collaboration mode (illustrated in Figure 3), all collaborators work on one copy of assets in the project. When a user works on an asset, for example, a notebook, it becomes locked until that user or an administrator unlocks it. Since only one version of the asset exists, changes are immediately available to all collaborators.
Option 2: Collaboration via Git for all assets with the exception of JupyterLab
The second option is collaboration via Git for all assets with the exception of JupyterLab, illustrated in Figure 4 (JupyterLab has its own Git integration). This option is enabled when the project is connected to a Git repo. Once the project is connected, users can use the commit, push, and pull operations to synchronize the project with the repo and make their work visible to other collaborators. Changes are tracked in the Git commit history.
Option 3: JupyterLab collaboration with Git
The third option is collaboration with JupyterLab as illustrated in Figure 5. JupyterLab is an IDE that’s used for editing notebooks, and Git integration is provided by the JupyterLab Git extension.
While, conceptually, Git integration in JupyterLab is similar to Option 2, JupyterLab includes a file management component, which means that notebooks and data files can be stored in the JupyterLab IDE and these files are referred to as JupyterLab files. All other files in the project are called “project assets.”
As explained in the general process description, it is critical for data scientists to understand their data and the relationship between the data features before modeling. Inside Jupyter Lab or RStudio in Watson Studio, they can leverage all open source frameworks for data visualization. Citizen data scientists with no coding skills can use Modeler flow (IBM SPSS Modeler) to do exploratory data analysis. It contains many different plot nodes, which do not require any coding and produce plots automatically.
Watson Studio provides additional capabilities in this area to speed up the time from data discovery to insight. Data included in a project is profiled, showing the data scientist feature distribution and high-level statistical information about each field. Data scientists can also leverage Embedded Dashboards — an integrated, interactive drag-and-drop visualisation interface available in Watson Studio — to quickly view their data in a wide array of graphical outputs. This is achieved with no coding and powered with embedded AI capabilities to help them choose the right visualizations and find outliers faster. Furthermore, these users can interact with the data, enabling them to focus their investigations into the areas of importance.
As discussed in the general process section, open source frameworks implement most popular algorithms to build machine learning models. This means that support for such frameworks has become a critical requirement for data science and AI platforms to empower the teams to efficiently leverage the latest innovations being pioneered in open source.
Data scientists build machine learning and deep learning models using a large number of popular open source AI and ML tools, libraries, and frameworks supported by Watson Studio out of the box, including Spark MLlib, scikit-learn, XGBoost, TensorFlow, Keras, PyTorch, Caffe, and others. They can also customize the environment to load other open source libraries that are needed for a specific project.
Citizen and expert data scientists speed up their exploration significantly with Watson Studio’s AutoAI functionality, which automates several aspects of the AI pipeline, including feature transformation, feature engineering, algorithm selection, and hyperparameter optimization (Figure 6). AutoAI capabilities have become table stakes for data and AI platform providers and a major driver to help organizations fulfil the promise of AI.
AutoAI helps empower citizen data scientists and significantly improve the productivity of expert data scientists by speeding up experimentation. Additionally, data scientists can further customize the results generated by AutoAI, which may be necessary for complex use cases. To support that, AutoAI pipelines can be exported into notebooks for further customization and tuning:
The value of IBM Watson Studio and IBM Cloud Pak for Data
IBM Watson Studio offers a collaborative environment for data scientists, analysts, application developers, and subject matter experts. It provides powerful tools and technologies to bring predictions into workflows with data preparation and modeling. IBM Watson Studio speeds the time to value from AI/ML investment by providing out-of-the-box support for popular open source frameworks, enabling data scientists to manage the AI lifecycle at ease. It also makes it easy to get started by providing visual data science tools and AutoAI capabilities.
As organizations scale AI applications, it is important to have a complete end-to-end view of all steps involved in developing, deploying, and monitoring AI models. IBM Cloud Pak for Data is designed to support the overall AI lifecycle with end-to-end tools for enterprise-grade ModelOps.
In this blog post, we focused on one component of the overall lifecycle — the Build AI Models phase — and explained the background, business need, and general process while focusing on the who (roles), the what (tasks), and the how (tools). Lastly, we illustrated how IBM Watson Studio in the Cloud Pak for Data supports the data scientists’ needs for a collaborative platform for agile model development with out-of-the-box support for popular open source AI frameworks. Additionally, Watson Studio supports no-code visual framework for model development and AutoAI capabilities to cater for the preferences and improve the productivity of all data scientists.
For further details on the other components of the AI Model Lifecycle Management, please check out the other blogs in this series or the complete AI Model Lifecycle Management white paper:
- AI Model Lifecycle Management: Overview
- AI Model Lifecycle Management: Collect Phase
- AI Model Lifecycle Management: Organize Phase
- AI Model Lifecycle Management: Deploy Phase
- AI Model Lifecycle Management: Monitoring Phase (Technical Perspective)
- AI Model Lifecycle Management: Monitoring Phase (Customer Perspective)
Thank you to Dimitriy Rybalko, Ivan Portilla, Kazuaki Ishizaki, Kevin Hall, Neeraj Madan, Manish Bhide, Thomas Schack, Sourav Mazumder, John Thomas, Matt Walli, Rohan Vaidyanathan, and other IBMers who are collaborating with me on this topic.