November 11, 2020 By Neeraj Madan 4 min read

This Part 2 of a blog series on AI Model Lifecycle Management focuses on the collection of data.

Data [dey-tuh, dat-uh, dah-tuh]: First recorded in 1640–50; from Latin, plural of datum. The data in its raw form is not usable, whereas the success of any data science initiative relies heavily on a strong foundation of data, and for that reason, the first and most critical step in AI Model Lifecycle Management is the Collect phase.

Enterprises deal with “5 V’s of data” while converting raw data to valuable information:

  1. Volume: How big is the data? (E.g., kilobyte (KB), megabyte (MB), gigabyte (GB), etc.)
  2. Velocity: How often is the data updated? (E.g., hourly, weekly, monthly, etc.)
  3. Variety: What is the type of data? (E.g., structured, semi-structured, or unstructured)
  4. Veracity: How accurate and trustworthy is the data? (E.g., quality of the dataset)
  5. Value: What outcome is going to be driven through this data? (E.g., disease detection, customer satisfaction, reduced costs, etc.)


The world population is growing significantly year over year. As of April 2020, the internet reaches 59% of the world’s population, which now represents 4.57 billion people. That’s a lot of data!  

The data collected from multiple sources is stored and governed in public, private, or on-premises cloud. The exponential growth of data and multiple storage mediums adds to complexity, cost, time, and risk of error in processing and analysing data.

Solution (IBM tooling)

The IBM Cloud Pak® for Data helps to minimize data movement expenses by reducing the ETL (extract, transform, and load) requests from 25% to 65%. It also reduces storage cost by saving up to 95% of storage capacity through using snapshots instead of full copies of functional TestDev.

The Collection phase of AI Model Lifecycle Management requires contribution from Data Consumer, Data Providers, Data Engineer, and Data Steward. Based on the scale of the organization and artificial intelligence (AI) initiative, these roles overlap or remain stand-alone.

  1. Data Consumers submit a data request explaining their data needs and provide keywords/column names to specify what data is useful for their project.
  2. Data Providers collect and make the data available via connectors on the IBM Cloud Pak for Data for Data Science project consumption. They make the data available using different approaches:
    • Copying the data to a data warehouse or data lake
    • Setting up connections to data sources
    • Virtualizing data assets
  3. Data Engineers build data transformation code, ETL flows, and/or pipelines for data set integration.
  4. Data Stewards apply data governance rules and business policies, approve data assets access, and expose the data to data consumers using the enterprise data catalog.

These steps are cyclical in nature. For more details, please refer the data collection flow (Figure 1):

Figure 1: Data collection flow.

Data virtualization in the IBM Cloud Pak for Data is a unique new technology that connects all these data sources into a single, self-balancing collection of data sources or databases, referred to as a constellation. It’s a three-step process:

  1. Connect: Set up connection with multiple data sources.
  2. Join and Create Views: Perform ETL (extract, transform, and load).
  3. Consume: Use the data pipelines to drive the desired business outcomes.

For more details, please refer to the data virtualization process (Figure 2):

Figure 2: Data virtualization process.

The design and architecture of peer-to-peer computational mesh lends a significant advantage over traditional federation architecture. Using advancements from IBM Research, the data virtualization engine rapidly delivers query results from multiple data sources by leveraging advanced parallel processing and optimizations.

The key benefits are as follows:

  • Query across multiple databases and big data repositories, individually or collectively.
  • Centralize access control and governance.
  • Make many databases — even globally distributed — appear as one to an application.
  • Simplify data analytics with a scalable and powerful platform.

Collaborative highly paralleled compute models provide superior query performance compared to federation, up to 430% faster against 100TB datasets.


The AI ecosystem is growing at a rapid pace and managing the lifecycle of AI models is a necessity. The data, on its own, is not usable. Thus, a systematic collection of data using tooling (IBM Cloud Pak for Data) and methodology enables us to convert data to information. This, in turn, helps to better prepare for the Organize phase of AI Model Lifecycle Management.

To learn the other phases of AI Model Lifecycle Management, please check out the blog series linked below or see the detailed white paper.


Thank you to Dimitriy Rybalko, Ivan Portilla, Kazuaki Ishizaki, Kevin Hall, Manish Bhide, Thomas Schack, Sourav Mazumder, John Thomas, Matt Walli, Rohan Vaidyanathan, and other IBMers who have collaborated with me on this topic.

Was this article helpful?

More from Cloud

Innovation with IBM® LinuxONE

4 min read - The IBM® LinuxONE server leverages six decades of IBM expertise in engineering infrastructure for the modern enterprise to provide a purpose-built Linux server for transaction and data-serving. As such, IBM LinuxONE is built to deliver security, scalability, reliability and performance, while it’s engineered to offer efficient use of datacenter power and footprint for sustainable and cost-effective cloud computing. We are now on our fourth generation of IBM LinuxONE servers with the IBM LinuxONE Emperor 4 (available since September 2022), and IBM…

6 ways to elevate the Salesforce experience for your users

3 min read - Customers and partners that interact with your business, as well as the employees who engage them, all expect a modern, digital experience. According to the Salesforce Report, nearly 90% Of buyers say the experience a company provides matters as much as products or services. Whether using Experience Cloud, Sales Cloud, or Service Cloud, your Salesforce user experience should be seamless, personalized and hyper-relevant, reflecting all the right context behind every interaction. At the same time, Salesforce is a big investment,…

IBM Tech Now: February 12, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 92 On this episode, we're covering the following topics: The GRAMMYs + IBM watsonx Audio-jacking with generative AI Stay plugged in You can check out the IBM Blog Announcements for a full rundown of…

Public cloud vs. private cloud vs. hybrid cloud: What’s the difference?

7 min read - It’s hard to imagine a business world without cloud computing. There would be no e-commerce, remote work capabilities or the IT infrastructure framework needed to support emerging technologies like generative AI and quantum computing.  Determining the best cloud computing architecture for enterprise business is critical for overall success. That’s why it is essential to compare the different functionalities of private cloud versus public cloud versus hybrid cloud. Today, these three cloud architecture models are not mutually exclusive; instead, they work…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters