AI Model Lifecycle Management: Organize Phase

5 min read

The AI Ladder and how to create a business-ready analytics foundation.

Figure 1: The AI Ladder [1].

Figure 1: The AI Ladder [1].

As mentioned in earlier posts in this series, data is the foundation of every data science project. However, raw data can be low quality, incorrect, irrelevant, or intentionally misleading. In fact, most companies do not know what data they actually have, where it resides, what processes use it, and how to stay compliant with current data-related laws and regulations.

To gain business value from data, enterprises need data to be curated, governed, and trusted. Organizations must consider these three characteristics during the Organize step in the AI Ladder:

  1. Curated data: Companies must address data quality and determine whether their data is business ready. Has the data been cleansed? Is it complete and compliant? If so, then it can be used to build artificial intelligence (AI) models. Curated data has been reviewed and transformed to improve its quality.
  2. Governed data: Organizations need to have a catalog of their data to provide information about its source, ownership, and metadata associated with its business context. If data is not properly catalogued with an up-to-date inventory, it is hard to manage it, especially when new data is available for users from integrated or replicated data sources. The data becomes difficult (or impossible) to know, trust, and use. Governed data conforms to the business rules and policies of the enterprise.
  3. Trusted data: Organizations must ensure that only entitled users have access to data and govern the data itself. Leaders need to be confident in their AI implementations by ensuring that the data fed to it is trustworthy, complete, and consistent. Trusted data has been verified for quality and relevance.

Business context

Enterprises are increasingly exploring new data sources to maintain their competitive advantage, but not all raw data can be trusted. It is critical to leverage tools, processes, and methodologies to guarantee raw data is reviewed, curated, transformed, governed, and trusted.

Additionally, according to several analyst reports [2, 3], most data scientists spend 80% of their time finding and manipulating data.

Here are some examples detailing how companies can work to organize their data into a trusted, business-ready analytics foundation:

  • Focusing on data preparation and quality (DataOps and AI): DataOps, or Data Operations, is a methodology that orchestrates people, process, and technology to deliver a continuous business-ready data pipeline to data consumers at the speed of the business. This enables collaboration among data consumers and data providers, creates a self-service data culture, and removes bottlenecks in the data pipeline to drive agility and new initiatives at scale.
  • Governing the data lake: A properly designed and governed data lake eliminates data inconsistencies, resolves duplicates, and creates a single version of the truth for users to access. By managing and mastering data in the data lake, organizations are able to find and trust the data they’re working with, thus making it easy for companies to build AI models and extract actionable insights from the data lake they can trust.
  • Modernizing applications: Companies want their data to be secure and meet compliance regulations to protect customers and users. Companies also want to be able to deploy higher-quality applications in less time, and at a lower cost, by provisioning and refreshing test data environments on-premises or in the cloud.
  • Ensuring data privacy and regulatory compliance: Organizations are under two competing business goals: running a profitable business and conforming to ever-evolving laws and regulations. It is essential to meet privacy obligations and protect personal data, which requires the discovery and classification of different types of data across the business.
  • Providing 360-degree, information-driven insights: With multiple sources of competing customer and product data across an enterprise, organizations need a data management solution to establish a single, trusted view of information, and they need to use real-time data as a critical asset.

A typical analytics project consists of several iterations between the Collect and Organize phases. Data consumers request data assets; data providers virtualize potentially relevant data assets; data engineers analyze and apply ETL transformation on virtualized data assets; and data stewards apply governance rules before publishing such assets to the enterprise data catalog.

Catalogued data is applied in analytics projects, and this process repeats through several iterations until relevant data assets are discovered, curated, catalogued, and applied to train AI models that satisfy project goals.

The overall process consists of the following three steps:

  1. Data stewards and data engineers process raw data and transform it into curated, governed, and trusted data. As explained in an earlier post, data providers make data available on the platform by creating connections to data sources or virtualizing data assets. Data engineers and data stewards discover assets to get insights about the quality and business content of the data analyzed from various data connections and virtualized data assets.
  2. Data engineers then shape and curate data assets with potential business value by defining and executing ETL (Extract, Transform, Load) and data transformation flows/pipelines, including operations such as filters, joins, aggregations, substitutions, and other calculations. Data stewards apply governance rules and policies to make the data available to be consumed by analytics projects while enforcing enterprise compliance requirements.
  3. Once data is transformed and governed, it is catalogued and made discoverable in the enterprise data catalogue to be consumed by various analytics projects. Additionally, data scientists publish other analytics assets like notebooks, experiments, and models to the catalog where they are governed and made discoverable for use by other analytics projects in the enterprise.

Watson Knowledge Catalog in IBM Cloud Pak® for Data

Using IBM Watson Knowledge Catalog (WKC) in the IBM Cloud Pak for Data, data engineers and data stewards discover assets from various data connections to assess the quality and business content of such data assets. WKC is an intelligent enterprise data catalog solution powering self-service activation and discovery of data via the following capabilities:

  • Catalogs that contain curated assets visible to consumers based on access controls and governance policies.
  • Projects that are work areas for data source auto-discovery, curation, refinement, exploration, and collaboration.

As illustrated in Figure 2, WKC supports several automated metadata curation services of data assets, including automated discovery, automated classification, automated detection of sensitive information, automated analysis, and automated assignment of business terms.

Figure 2: IBM Watson Knowledge Catalog.

Figure 2: IBM Watson Knowledge Catalog.

Data consumers can leverage WKC to speed up the discovery of data assets accessible to them and relevant to the data science project based on business glossary terms.

In the Organize phase, data engineers can also leverage Data Refinery in the Cloud Pak for Data to define data transformation flows consisting of a rich set of common data transformations using a visual UI for defining such flows.

In addition, data stewards can apply rules to mask sensitive and PII (personally identifiable information) data using WKC’s support for tagging and automatic classification of data assets.

Similarly, data stewards can leverage WKC to define policies and rules to make sure the correct data is accessible by the right teams and individuals so that the enterprise’s governance and compliance requirements are maintained.

In summary, organizations need their data to be cleansed, organized, catalogued, and governed to ensure that only people who should be able to access it, can access it. In a world of rapid prototyping and enormous sources of data, IBM Watson Knowledge Catalog provides users the ability to discover many different data assets from a data source and identify most relevant data assets within seconds, greatly reducing the time needed to find and access data. Thus, once data is curated, governed and trusted, it can be catalogued and made discoverable so that it can be leveraged in analytics and AI projects.

Further reading

This post is part of a series on AI Model Lifecycle Management. See the links below for the posts that have already been published, and check back every few days as we update with newly published parts:

For a deeper dive into the subject, see our white paper.


[1] The AI Ladder

[2] 2016 Data Science Report by CrowdFlower

[3] Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity

Be the first to hear about news, product updates, and innovation from IBM Cloud