November 11, 2020 By Neeraj Madan 4 min read

This Part 2 of a blog series on AI Model Lifecycle Management focuses on the collection of data.

Data [dey-tuh, dat-uh, dah-tuh]: First recorded in 1640–50; from Latin, plural of datum. The data in its raw form is not usable, whereas the success of any data science initiative relies heavily on a strong foundation of data, and for that reason, the first and most critical step in AI Model Lifecycle Management is the Collect phase.

Enterprises deal with “5 V’s of data” while converting raw data to valuable information:

  1. Volume: How big is the data? (E.g., kilobyte (KB), megabyte (MB), gigabyte (GB), etc.)
  2. Velocity: How often is the data updated? (E.g., hourly, weekly, monthly, etc.)
  3. Variety: What is the type of data? (E.g., structured, semi-structured, or unstructured)
  4. Veracity: How accurate and trustworthy is the data? (E.g., quality of the dataset)
  5. Value: What outcome is going to be driven through this data? (E.g., disease detection, customer satisfaction, reduced costs, etc.)

Introduction

The world population is growing significantly year over year. As of April 2020, the internet reaches 59% of the world’s population, which now represents 4.57 billion people. That’s a lot of data!  

The data collected from multiple sources is stored and governed in public, private, or on-premises cloud. The exponential growth of data and multiple storage mediums adds to complexity, cost, time, and risk of error in processing and analysing data.

Solution (IBM tooling)

The IBM Cloud Pak® for Data helps to minimize data movement expenses by reducing the ETL (extract, transform, and load) requests from 25% to 65%. It also reduces storage cost by saving up to 95% of storage capacity through using snapshots instead of full copies of functional TestDev.

The Collection phase of AI Model Lifecycle Management requires contribution from Data Consumer, Data Providers, Data Engineer, and Data Steward. Based on the scale of the organization and artificial intelligence (AI) initiative, these roles overlap or remain stand-alone.

  1. Data Consumers submit a data request explaining their data needs and provide keywords/column names to specify what data is useful for their project.
  2. Data Providers collect and make the data available via connectors on the IBM Cloud Pak for Data for Data Science project consumption. They make the data available using different approaches:
    • Copying the data to a data warehouse or data lake
    • Setting up connections to data sources
    • Virtualizing data assets
  3. Data Engineers build data transformation code, ETL flows, and/or pipelines for data set integration.
  4. Data Stewards apply data governance rules and business policies, approve data assets access, and expose the data to data consumers using the enterprise data catalog.

These steps are cyclical in nature. For more details, please refer the data collection flow (Figure 1):

Figure 1: Data collection flow.

Data virtualization in the IBM Cloud Pak for Data is a unique new technology that connects all these data sources into a single, self-balancing collection of data sources or databases, referred to as a constellation. It’s a three-step process:

  1. Connect: Set up connection with multiple data sources.
  2. Join and Create Views: Perform ETL (extract, transform, and load).
  3. Consume: Use the data pipelines to drive the desired business outcomes.

For more details, please refer to the data virtualization process (Figure 2):

Figure 2: Data virtualization process.

The design and architecture of peer-to-peer computational mesh lends a significant advantage over traditional federation architecture. Using advancements from IBM Research, the data virtualization engine rapidly delivers query results from multiple data sources by leveraging advanced parallel processing and optimizations.

The key benefits are as follows:

  • Query across multiple databases and big data repositories, individually or collectively.
  • Centralize access control and governance.
  • Make many databases — even globally distributed — appear as one to an application.
  • Simplify data analytics with a scalable and powerful platform.

Collaborative highly paralleled compute models provide superior query performance compared to federation, up to 430% faster against 100TB datasets.

Summary

The AI ecosystem is growing at a rapid pace and managing the lifecycle of AI models is a necessity. The data, on its own, is not usable. Thus, a systematic collection of data using tooling (IBM Cloud Pak for Data) and methodology enables us to convert data to information. This, in turn, helps to better prepare for the Organize phase of AI Model Lifecycle Management.

To learn the other phases of AI Model Lifecycle Management, please check out the blog series linked below or see the detailed white paper.

Acknowledgments

Thank you to Dimitriy Rybalko, Ivan Portilla, Kazuaki Ishizaki, Kevin Hall, Manish Bhide, Thomas Schack, Sourav Mazumder, John Thomas, Matt Walli, Rohan Vaidyanathan, and other IBMers who have collaborated with me on this topic.

Was this article helpful?
YesNo

More from Cloud

IBM Cloud Virtual Servers and Intel launch new custom cloud sandbox

4 min read - A new sandbox that use IBM Cloud Virtual Servers for VPC invites customers into a nonproduction environment to test the performance of 2nd Gen and 4th Gen Intel® Xeon® processors across various applications. Addressing performance concerns in a test environment Performance testing is crucial to understanding the efficiency of complex applications inside your cloud hosting environment. Yes, even in managed enterprise environments like IBM Cloud®. Although we can deliver the latest hardware and software across global data centers designed for…

10 industries that use distributed computing

6 min read - Distributed computing is a process that uses numerous computing resources in different operating locations to mimic the processes of a single computer. Distributed computing assembles different computers, servers and computer networks to accomplish computing tasks of widely varying sizes and purposes. Distributed computing even works in the cloud. And while it’s true that distributed cloud computing and cloud computing are essentially the same in theory, in practice, they differ in their global reach, with distributed cloud computing able to extend…

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters