Cleaning the swamp: Turn your data lake into a source of crystal-clear insight

By: Jay Limburn

How a data lake becomes a data swamp

When we talk to data scientists, we hear the same sad story again and again. They tell us how their organization fell in love with the idea of building a data lake as a single platform for self-service data science. How they were wooed and won by a vendor with a solution that promised much, but delivered little. How their vision of a data lake as a clear source of business insight has turned into a stagnant swamp—a dumping ground where data goes to die.

How a data lake becomes a data swamp

The problem rarely lies with the infrastructure itself. If you want to capture, manage and analyze vast quantities of highly varied data, technologies such as Apache Hadoop, Apache Spark and Object Storage are a good way to provide the highly scalable storage and computing resources that you will need. From a pure technology perspective, there is nothing wrong with the physical architecture of the data lakes that many companies have built over the past few years.

The issue is that infrastructure alone isn’t enough. As the quantity and variety of data increases, it doesn’t just demand more storage and computing power—it also demands better organization and management.

By building data lakes with a focus on data capture, storage and processing, organizations have too often overlooked concerns such as data findability, classification and governance. This is the fundamental issue of the data swamp phenomenon: data goes in, but there’s no safe, reliable or easy way to find what you’re looking for and get it out again.

Why data gets lost in the swamp

First, there’s a common problem that much of an organization’s data never makes it into their data lake in the first place. This is partly due to the time and cost that need to be expended on building complex ETL processes to ingest new data sources into the lake.

But more importantly, there’s also a psychological reason: it’s all about trust. In theory, if you own a dataset that could be of value to others in your organization, you should upload it into the data lake for them to use. In practice this rarely happens, because data owners are too worried about the lack of data governance. They have no way of knowing who will use their data, or how they will use it. If the data contains commercially sensitive information or there are any data privacy concerns, the risk of opening it up to potential misuse is too high—and the data owner probably won’t want to take that risk.

Second, even if a dataset does get ingested into the data lake, users generally won’t be able to find it; or if they do, they won’t understand it or know how to use it. Without an understanding of the metadata to explain what the dataset is, what information it contains, how high the quality of the data is, and how other data scientists have used it, most data assets are practically worthless.

Finally, even the most comprehensive data lake can’t hold everything. Most data science involves combining proprietary data (such as a company’s daily sales figures or customer records) with external datasets (such as weather data, maps, or stock market prices). Without an easy way to integrate data from external sources with internal datasets, most data lakes force data scientists to do much of their work outside of the data lake ecosystem—which once again contributes to users bypassing the data lake and taking their datasets elsewhere.

Clearing muddy waters

These problems are all closely related and tend to reinforce each other in a vicious circle. Fortunately, however, their close relationship stems from a common root cause—and by addressing that root cause, we can solve all of the problems simultaneously.

Any attempt to manage and organize information—from a simple telephone directory up to the largest and most complex database—depends on two things: data and metadata. The data is the information itself, while the metadata describes the information’s attributes, such as what structure it is stored in, where it is stored, how to find it, who created it, where it came from, and what it can be used for. Most of today’s data lakes offer powerful capabilities for storing and processing data, but are comparatively weak in terms of managing metadata.

By augmenting your data lake with a metadata management platform, such as IBM Data Catalog (currently in beta), you can overcome these deficiencies and start unlocking the true value of your data. IBM Data Catalog enables you to build a comprehensive index of all your data assets, and automatically add useful metadata to help classify their content, understand their context, trace their lineage, and monitor their usage.

Users can add tags and comments to explain what information each dataset contains, and why it is useful. Meanwhile, data stewards can apply governance policies to ensure that only authorized users will be able to access sensitive resources, and can monitor any breaches.

As the quantity and quality of metadata attached to each asset increases, the solution’s intelligent search capability makes it easier for users to find the information they are looking for. And because IBM Data Catalog is primarily a metadata repository rather than a data store, it is capable of indexing data assets both within and beyond the data lake. This means that users can use it as a single interface to find, explore and integrate data regardless of whether it lives in the lake itself, is held in transactional systems, or comes from a third-party repository or service.

Gain crystal-clear insight

By strengthening your data lake’s metadata management capabilities, you can solve your data findability, management and governance issues. Solutions such as IBM Data Catalog enable you to create a detailed map that helps your data scientists navigate your data lake much more easily. Using the catalog as their compass, they can plumb the deepest depths of your datasets to obtain crystal-clear insight, explore unfamiliar waters in safety, and avoid running aground on hazardous data governance issues.

To take a more detailed look at how IBM Data Catalog can help you extend the value of your data lake, visit our website and sign up for updates about the beta.

Be the first to hear about news, product updates, and innovation from IBM Cloud