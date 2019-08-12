The problem with all this new data is that the majority of it is unstructured (to learn more about unstructured data, see “Structured vs. Unstructured Data: What’s the Difference?“). Storing and analyzing it has far exceeded the capacity of the traditional relational database management systems (RDBMS).

For businesses, the challenge was to find a way to incorporate these unstructured sources of data with their traditional business data, such as customer and sales information. This would provide a 360 degree view of their customers’ buying habits, and it would help a company make more targeted and strategic decisions on how to increase business.

This dilemma produced the concept of the data lake. A data lake is, essentially, a large holding area for raw data. They are low cost, highly scalable, and able to support extremely large data volumes and accept data in its native raw format from a wide variety of data sources.

The repository of choice has been primarily Hadoop. Hadoop allows you to store combinations of both structured and unstructured data. Hadoop is essentially a massive parallel file system which allows you to process large amounts of data in a timely fashion. The data can be analyzed via different methods, such as MapReduce, Hive (SQL), and, more recently, Apache Spark.

In the video “What is a Data Lake?”, Adam Kocoloski (IBM Fellow & VP, Cloud Databases) gives an overview of data lakes, their architecture, and how they can allow you to drive insights and optimizations across your organizations: