AI Model Lifecycle Management: Collect Phase

3 min read

This Part 2 of a blog series on AI Model Lifecycle Management focuses on the collection of data.

Data [dey-tuh, dat-uh, dah-tuh]: First recorded in 1640–50; from Latin, plural of datum. The data in its raw form is not usable, whereas the success of any data science initiative relies heavily on a strong foundation of data, and for that reason, the first and most critical step in AI Model Lifecycle Management is the Collect phase.

Enterprises deal with “5 V’s of data” while converting raw data to valuable information:

  1. Volume: How big is the data? (E.g., kilobyte (KB), megabyte (MB), gigabyte (GB), etc.)
  2. Velocity: How often is the data updated? (E.g., hourly, weekly, monthly, etc.)
  3. Variety: What is the type of data? (E.g., structured, semi-structured, or unstructured)
  4. Veracity: How accurate and trustworthy is the data? (E.g., quality of the dataset)
  5. Value: What outcome is going to be driven through this data? (E.g., disease detection, customer satisfaction, reduced costs, etc.)


The world population is growing significantly year over year. As of April 2020, the internet reaches 59% of the world’s population, which now represents 4.57 billion people. That’s a lot of data!  

The data collected from multiple sources is stored and governed in public, private, or on-premises cloud. The exponential growth of data and multiple storage mediums adds to complexity, cost, time, and risk of error in processing and analysing data.

Solution (IBM tooling)

The IBM Cloud Pak® for Data helps to minimize data movement expenses by reducing the ETL (extract, transform, and load) requests from 25% to 65%. It also reduces storage cost by saving up to 95% of storage capacity through using snapshots instead of full copies of functional TestDev.

The Collection phase of AI Model Lifecycle Management requires contribution from Data Consumer, Data Providers, Data Engineer, and Data Steward. Based on the scale of the organization and artificial intelligence (AI) initiative, these roles overlap or remain stand-alone.

  1. Data Consumers submit a data request explaining their data needs and provide keywords/column names to specify what data is useful for their project.
  2. Data Providers collect and make the data available via connectors on the IBM Cloud Pak for Data for Data Science project consumption. They make the data available using different approaches:
    • Copying the data to a data warehouse or data lake
    • Setting up connections to data sources
    • Virtualizing data assets
  3. Data Engineers build data transformation code, ETL flows, and/or pipelines for data set integration.
  4. Data Stewards apply data governance rules and business policies, approve data assets access, and expose the data to data consumers using the enterprise data catalog.

These steps are cyclical in nature. For more details, please refer the data collection flow (Figure 1):

Figure 1: Data collection flow.

Figure 1: Data collection flow.

Data virtualization in the IBM Cloud Pak for Data is a unique new technology that connects all these data sources into a single, self-balancing collection of data sources or databases, referred to as a constellation. It’s a three-step process:

  1. Connect: Set up connection with multiple data sources.
  2. Join and Create Views: Perform ETL (extract, transform, and load).
  3. Consume: Use the data pipelines to drive the desired business outcomes.

For more details, please refer to the data virtualization process (Figure 2):

Figure 2: Data virtualization process.

Figure 2: Data virtualization process.

The design and architecture of peer-to-peer computational mesh lends a significant advantage over traditional federation architecture. Using advancements from IBM Research, the data virtualization engine rapidly delivers query results from multiple data sources by leveraging advanced parallel processing and optimizations.

The key benefits are as follows:

  • Query across multiple databases and big data repositories, individually or collectively.
  • Centralize access control and governance.
  • Make many databases — even globally distributed — appear as one to an application.
  • Simplify data analytics with a scalable and powerful platform.

Collaborative highly paralleled compute models provide superior query performance compared to federation, up to 430% faster against 100TB datasets.


The AI ecosystem is growing at a rapid pace and managing the lifecycle of AI models is a necessity. The data, on its own, is not usable. Thus, a systematic collection of data using tooling (IBM Cloud Pak for Data) and methodology enables us to convert data to information. This, in turn, helps to better prepare for the Organize phase of AI Model Lifecycle Management.

To learn the other phases of AI Model Lifecycle Management, please check out the blog series linked below or see the detailed white paper.


Thank you to Dimitriy Rybalko, Ivan Portilla, Kazuaki Ishizaki, Kevin Hall, Manish Bhide, Thomas Schack, Sourav Mazumder, John Thomas, Matt Walli, Rohan Vaidyanathan, and other IBMers who have collaborated with me on this topic.

Be the first to hear about news, product updates, and innovation from IBM Cloud