Data Virtualization: The Evolution of the Data Lake

The idea of the traditional data center being centered on relational database technology is quickly evolving.

The adoption of big data is causing a paradigm shift in the IT industry that is rivaling the release of relational databases and SQL in the early 80s.

We are seeing an unprecedented explosion of growth of data volume. This growth is the result of the myriad new data sources that have been created within the last 10 years. Things like machine sensors which collect data from everything from your car to your blender, medical devices, RFID readers, web logs, and (especially) social media are generating terabytes of data every day. This new “smart data” can provide tremendous business value if it can be mined and analyzed.

Data is everywhere

The birth of the data lake

The problem with all this new data is that the majority of it is unstructured (to learn more about unstructured data, see “Structured vs. Unstructured Data: What’s the Difference?“). Storing and analyzing it has far exceeded the capacity of the traditional relational database management systems (RDBMS).

For businesses, the challenge was to find a way to incorporate these unstructured sources of data with their traditional business data, such as customer and sales information. This would provide a 360 degree view of their customers’ buying habits, and it would help a company make more targeted and strategic decisions on how to increase business.

This dilemma produced the concept of the data lake. A data lake is, essentially, a large holding area for raw data. They are low cost, highly scalable, and able to support extremely large data volumes and accept data in its native raw format from a wide variety of data sources.

The repository of choice has been primarily Hadoop. Hadoop allows you to store combinations of both structured and unstructured data. Hadoop is essentially a massive parallel file system which allows you to process large amounts of data in a timely fashion. The data can be analyzed via different methods, such as MapReduce, Hive (SQL), and, more recently, Apache Spark.

In the video “What is a Data Lake?”, Adam Kocoloski (IBM Fellow & VP, Cloud Databases) gives an overview of data lakes, their architecture, and how they can allow you to drive insights and optimizations across your organizations:

Data lake vs. data warehouse

It is important to understand the difference between data lakes and data warehouses. A data warehouse is highly structured. Much effort is done up front in developing schemas and hierarchies prior to the data being loaded into a warehouse.

Data lakes, on the other hand, have no hierarchy or structure to the way data is. The structure is applied afterwards. There can be multiple schemas applied to the same data in a data lake.

Typical data lake architecture

Shortcomings of the traditional data lake

This concept of merging all your data sources into a common repository has caused challenges for many organizations; in particular, constantly having to do data replication. The main repository has to be kept in sync with the local data sources. This, typically, requires having to run numerous ETL processes, which means there is high potential for data inconsistencies. The data is only as current as the last sync point.

Another issue is that as your data lake grows, you may have new groups of analysts looking for different views of data, which leads to the unnecessary duplication of data.

The third (and possibly biggest) challenge is that of data security and governance, including new GDPR regulations restricting data location. Sensitive data cannot be moved into the cloud or into a centralized repository. It has to remain in its native location. This limits your ability on utilizing this data for analytics.

Data virtualization as a solution

Data virtualization can be the solution for overcoming the shortcomings of a centralized repository. Let’s start with an understanding of what exactly data virtualization is.

Data virtualization is the ability to view, access, and analyze data without the need to know its location. Data virtualization can integrate data sources across multiple data types and locations, turning it into a single logical view without having the need to do any sort of data replication or movement.

View a short demo on data virtualization: “Intro to Data Virtualization”

Benefits of data virtualization

Reduce errors via increased data accuracy
Consume less resources due to ETL processed
Categorize data
Enforce governance rules
Have less disk requirements
Share data easily across organizations

Data virtualization vs. federation

It is important to understand the difference between data virtualization and data federation.

Data federation is the technology that allows you to logically map remote data sources and execute distributed queries against those multiple sources from a single location.

Data virtualization, on the other hand, is a platform which provides the end user experience, allowing a user to retrieve and manipulate data without requiring them to know any technical details about the data (e.g., how it is formatted or where it is physically located). Data virtualization provides the end user a self-service data mart of multiple data sources which can be joined into a single customer view.

IBM Cloud Pak for Data

IBM Cloud provides data virtualization in the cloud with IBM Cloud Pak for Data (ICP4D).

The data virtualization component is comprised of several different technologies within IBM—IBM’s common SQL engine, Db2 big SQL, and IBM’s query plex. The combination of all this technology provides a very unique and effective method for allowing you to virtualize your data across all these different data sources without having to actually replicate the data or move data in any sort.

To view some short videos, take a product tour, and get actual hands-on experience with IBM Cloud Pak for Data, visit our new IBM Demos site: IBM Cloud Pak for Data

Author

Jim Wankowski

Digital Technical Engagement

Watch the video