How data becomes knowledge, Part 2
Data lakes and data swamps
This content is part # of # in the series: How data becomes knowledge, Part 2
This content is part of the series:How data becomes knowledge, Part 2
Stay tuned for additional content in this series.
The data lake concept has been in existence for a few years now. It initially attracted some controversy and was labeled marketing hype. The term data lake wasn't part of any traditional data-storage architecture, so vendors freely used it to mean many different things.
Terminology for data storage, such as streams, pools, reservoirs, and clouds, is in widespread use in data science. Inevitably, people began drawing parallels to the natural water ecosystem so now we have data lakes and data swamps as well.
Analogies are great for explaining concepts, but there's always the danger of carrying the analogy too far until it fails. Analogies also make the terminology confusing if you're a new entrant to the field and don't know what it all really means. As the data lake concept has slowly gained acceptance, however, there have been attempts to define an architecture to formalize the concepts.
All that said, I'm going to explain these concepts by using yet another analogy. The sidebar shows the standard definitions of the terminology; the analogy that follows explains them in conceptual terms. My analogy is based on making a sandwich (in my defense, I'm writing this before lunch, and I'm hungry). I begin the analogy at a grocery store, where most of us get our sandwich makings.
A simple analogy
A grocery store has aisles and shelves on which employees sort and neatly store the groceries by category. You can easily select and buy the groceries you want. The grocery store is analogous to a database that stores data assets in table rows and columns for easy retrieval.
The groceries the store stocks come from multiple sources and suppliers, arrive at various times, and have different sell-by dates. Similarly, data can come from multiple data sources at various times. Data can also become stale, just like groceries. Like the many ingredients from the grocery store that go into a sandwich, information is a collection of cataloged data in a specific context. In other words, the sandwich is analogous to information.
The whole vegetables and greens are analogous to unstructured data; the sliced and diced vegetables and greens are analogous to structured data. (To make this analogy work, I assume that the whole veggies are unstructured.)
Now, assume that your local sandwich shop selects and buys groceries from this grocery store, cleans and washes the groceries, cuts them for use in sandwiches, and bins them separately — just like cleaning, structuring, and normalizing data before using it for analysis.
When you want to eat a sandwich, you head to the sandwich shop. The sandwich shop could also have different counters where you can get a sandwich, wraps, or salads — analogous to data marts and data warehouses. Just like a counter is a subset of the sandwich shop, the data mart is a subset of the data warehouse. A data mart corresponds to an individual department, while a data warehouse corresponds to the entire enterprise.
At the sandwich shop, you look at the menu and decide what kind of sandwich you want; then, you order it. The sandwich maker uses the same repetitive process to make each sandwich; indeed, you can find some sandwiches already made and wrapped for immediate consumption. The sandwich shop is analogous to the menu for the business intelligence (BI) tools integrated with the data warehouse. The analytics also uses repetitive processes to generate reports and provide users with some canned reports for immediate consumption.
Most people prefer to customize their sandwich, asking for changes in the quantities of the ingredients, changing the garnishing, or omitting some of the ingredients. Likewise, with BI tools, you can customize reports by selecting specific data. Just like you can create your own sandwich by specifying the ingredients to the sandwich maker, you can also create custom analytics reports by specifying the data and algorithms in the BI menu.
Now, imagine that you're a food inspector and want to ensure that none of the groceries used to prepare the sandwiches was contaminated. Also you want to ensure that the process used for food preparation, including washing, cleaning, and dicing, was consistent and done under sanitary conditions. In such a case, you would need to audit the processes used for food preparation and periodically inspect the food preparation area.
Similarly, auditors need to access the raw data to verify that there has been no contamination of the data in the data preparation process because of transcription, cleaning, formatting, and normalizing. Unlike in the case of the groceries in the sandwich shop, you can copy and clone data. So, for compliance and auditing, storage of the raw data is possible.
Originally, data lake referred to the data reservoir holding raw data as well as unstructured data such as text, images, audio, and video. However, as mentioned, vendors have other definitions of data lake.
Continuing the analogy, imagine a finicky consumer who's suspicious of the origins and freshness of the ingredients in the containers on the sandwich counter. The consumer might also want to put vegetables or meats not available in the sandwich shop into their sandwich. The sandwich shop is certainly not going to allow consumers to go behind the counter to prepare their own sandwich, so the consumer has no choice but to go to the grocery store to buy groceries and make their sandwich in their own kitchen. Often, professional analysts and data scientists want access to the raw data rather than to the prepared aggregate summary data stored in the data warehouse: They would rather get the latest data from source to ensure its validity and relevance. They might also want to see the arrival velocities of the data, which could suffer from masking during the preparation process. If analysts want to see other data not considered in the data warehouse, they will want to access the raw databases directly. Rather than accessing the source data directly, a data lake keeps clones of the raw databases for such access needs and to sandbox new analytics.
Sometimes, a gourmet sandwich maker might insist on getting ingredients farm fresh from the farmer rather than the grocery store. In that case, that gourmet sandwich maker must duplicate the functions of the grocery store produce buyer, which is analogous to real-time data such as from an Internet of Things (IoT) device. In such a case, the data lake must perform extract, transform, load (ETL) functions as well for such real-time data streams.
Finally, imagine a seedy sandwich shop. The containers at the counter don't have labels. Vegetables and meats overflow into one another willy-nilly, and even the sandwich maker is unsure exactly what type of meat is in that last container. Customers might walk out because they can't be sure what kind of sandwich they're getting. This is analogous to a data swamp, which is a poorly maintained data lake. The data is like mystery meat, and no one can confirm the antecedents of some of the data. Good data is inaccessible because the data swamp doesn't appropriately document (or worse, wrongly documents) the metadata labels or some of the data is in a format that the integrated tools can't read or is not retrievable by a query.
Why do we really need data lakes?
You now know that we need data lakes for several reasons:
- As a raw data repository for compliance and audit purposes (for example, audio and video recordings, document scans, and text and log files)
- As a platform for data scientists and analysts to access both structured and unstructured data for validation purposes and to sandbox new analytics models
- As a platform to integrate real-time data from operational or transactional systems and, increasingly, sensor data from IoT devices
The aggregate and summary data that the data warehouse provides is enough for most BI users. The users of a data lake can be auditors, specialist analysts, and data scientists (who are in the minority). What other compelling reasons are there for an enterprise to choose to create a data lake? Therefore, it's worthwhile to examine how the data lake differs from a data warehouse.
What's the difference between a data warehouse and a data lake?
Data warehouses are a mature and secure technology with a formal architecture. They store fully processed and structured data subject to data governance processes. Data warehouses combine data into an aggregate, summary form for use enterprise-wide and write metadata and schema definitions while performing the data Write operations. Data warehouses usually have fixed configurations; they are highly structured and therefore less flexible and agile. A cost is associated with processing all the data before storage, and large volume storage is relatively more expensive.
Data lakes, in contrast, are a newer technology and have evolving architectures. Data lakes store raw data in any form — both structured and unstructured — and in any format, including text, audio, video, and images. As defined, a data lake is not subject to data governance, but experts agree that good data management is essential to prevent a data lake from turning into a data swamp. Data lakes create schemas during data Read operations. Data lakes are less structured and more flexible; they offer better agility than data warehouses. No processing is necessary until data retrieval, and data lakes use inexpensive storage by design.
Despite their advantages, data lakes have some catching up to do regarding security, governance, and management. But, there is an elephant in the room that is a compelling driver.
Machine learning and deep learning as drivers
One of the least discussed yet probably the most compelling reasons to adopt data lakes are the rising adoption of machine learning and deep learning technologies for data mining and analytics. Software auditing is a mature domain for traditional search and analytics, but it's in its infancy when it comes to machine learning and deep learning technologies used for data mining and analytics.
Speech transcription, optical character recognition, image and video recognition, and so forth, now routinely use machine learning or deep learning technologies. Data scientists need to access the raw, unstructured data to train these systems to perform systems validation and to ensure an audit trail. Similarly, deep learning performs tasks such as data mining to find patterns and relationships between dimensional and time-series data.
Another deep learning application is to extract formerly inaccessible data that a query cannot retrieve. Such data, called dark data, is the subject of the next segment in this series. The advent of machine learning and deep learning in data mining and analytics applications is a very compelling reason to move to data lake architectures.
The benefits of data lakes
Data lakes have several benefits:
- Easy data collection and ingestion: All the data sources in an enterprise feed into the data lake. The data lake, therefore, becomes a seamless point of access to both structured and unstructured data stored in either on-premise servers or cloud servers. The entire silo-less data collection is thus easily available for ingestion by data analytics tools. Besides, the data lake can store data in multiple formats, such as text, audio, video, and images, in multiple file formats. This flexibility simplifies the integration of legacy data stores.
- Support for real-time data sources: Data lakes support ETL functions for real-time and high-velocity data streams, which allows the convergence of sensor data from IoT devices with other data sources within the data lake.
- Faster data preparation: Analysts and data scientists don't have to spend time accessing multiple sources directly and can search for, find, and access data much more easily, speeding the data-preparation and reuse process. Data lakes also track and confirm data lineage, which helps to ensure that data is trustworthy and produces prompt BI for data-driven decision making.
- Better scalability and agility: Data lakes can take advantage of distributed file systems for storage and are thus highly scalable. The use of open source technologies also reduces storage costs. Data lakes are less rigidly structured and therefore inherently offer better flexibility which results in better agility. Data scientists can create sandboxes within the data lake to develop and test new analytics models.
- Advanced analytics with artificial intelligence: Access to raw data, the capability to create sandboxes, and the flexibility to reconfigure, make data lakes a powerful platform to rapidly develop and use advanced analytics models. Data lakes are ideally suited to the use of machine learning and deep learning to perform tasks such as data mining and data analysis as well as for the extraction of unstructured data.
The evolution of data lakes
The evolution of data lakes is more a convergence of technologies than an evolution. Data warehouses were an evolutionary step up from their predecessor, the relational databases, but we cannot say the same for data lakes and data warehouses.
Data lakes bring together diverse technologies, including data warehousing, real-time and high-velocity data streaming technologies, data mining, deep learning, distributed storage, and other technologies. There is a feeling, however, that data lakes have a limited user group among professional data scientists or analysts. Another common misconception is tying the data lake concept to a specific enabling technology such as Hadoop.
The data lake concept has a much greater potential than any one underlying technology, though, and it is in the process of continuous evolution as vendors add features and functionality. Potential areas of growth include:
- Architectural standardization, and interoperability
- Data governance, management, and curation
- Holistic data security
As with most evolving technologies, competition among vendors and business drivers pushes the barrier. It's only a matter of time before data lakes gain widespread acceptance among the pantheon of data-storage technologies.
The application of data lakes
Some features of data lakes make them well suited to certain applications. This section examines two of them.
Healthcare and the life sciences
Data lakes can help resolve electronic medical record (EMR) interoperability issues. The intention of the federal mandate for the use of EMRs was to give physicians the ability to access patient medical records across multiple systems and for easy transition of patient care between providers. In practice, many of these records — both insurance claims and clinical data — are either not interoperable or not in the form of machine-readable data. Data lakes store records in any format until retrieval. So, patient records might also include handwritten doctor's notes, medical imaging, and so on. Data lakes also have the ability to extract and store data from real-time data streams, resulting from the growing use of medical device telemetry and the IoT in health care.
Banking and finance
The banking and finance industry typically deals with multiple data sources. It also deals with high-velocity transaction data, from stock markets to credit cards, and other banking transactions. Banking and financial institutions routinely store legal and other documents for regulatory compliance and audit requirements. Data lakes are ideal for storing these mixed data formats and to store legacy data digitally for easy retrieval. Data lakes serve as an agile platform for ingesting multiple data streams for the heavy use of analytics in this industry vertical.
Data lakes, when designed and implemented properly, are a powerful way to store large volumes of multiformat data without the need for silos. They cut the time and cost of data ingestion and transformation and thus make the data available promptly to users. They also allow the use of lower-cost distributed storage. Data lakes have yet to mature architecturally, and there is currently a lack of standardization between vendor offerings. Data lakes are still evolving and adding new functionality to improve features for access control, security, data management, curation, and so on. The advent of machine learning and deep learning technologies for data mining and analytics introduced the need for a platform that provides easy access to raw data to train these systems, for systems validation, and to ensure an audit trail. Data lakes are an elegant answer to that need. Deep learning also enables access to previously ingested legacy data in data lakes that is inaccessible through standard query mechanisms. This so-called "dark data" is the subject of Part 3 of this series.