June 19, 2019 By Adam Kocoloski 6 min read

An overview of data lakes, their architecture, and how they can allow you to drive insights and optimizations across your organizations.

A data lake is a centralized repository that allows you store vast amounts of structured and unstructured data. Data-driven businesses often use this architecture to drive business value from their data assets and break down organizational silos. 

In this lightboarding video, I’m going to cover data lake architecture and explain how data lakes can empower you to collaborate and analyze data in different ways that will help you make smarter and faster business decisions. I hope you enjoy!

Learn more about data lakes

Video Transcript

What are data lakes?

Hi everyone, my name’s Adam Kocoloski with IBM Cloud, and I’m here to talk to you today about data lakes—what they are, how you use one, and the kind of things you ought to be thinking about as you set one up to power your applications and create more intelligent experiences for users.

Using data lakes to navigate a world of data

So, data lakes exist because we’re all awash with data and we’ve got systems of record, we’ve got systems of engagement, we’ve got streaming data, we’ve got batch data, internal/external data. And it’s really a combination of these different kinds of data sources that leads us to get powerful insights about what our users are doing, about the way the world is working around us, and it leads us to develop more intelligent applications.

Ingestion framework

Data lakes start by collecting all those different types of data sources through a common ingestion framework, and that ingestion framework is something that typically wants to be able to support a diverse array of different types of data.

Storage repository

And it wants to kind of standardize and centralize all that stuff into a common storage repository.

That’s not always required but, typically, you don’t want to be analyzing the source data directly—you want to be able to take a copy of it so that you’ve got the flexibility to do the kind of things you need to do with that data.

Data cleansing, data preparation, and feature extraction

And speaking of that, the data typically doesn’t come in a form where you can use it right out of the box. There’s a lot of data cleansing and data preparation that’s required. There is often times the ability to, or the requirement to create new features—something we call feature extraction. Combinations of different types of data that need to be pulled together in order to create the right sort of bits of information to analyze.

Machine learning model training and advanced analytics

And once you sort of cleanse that data, prep the data, model the right kind of features for your analysis, then you get to the fun part which is actually going in and doing the machine learning model training and doing your advanced analytics.

Derived data sets

And each of these steps is typically creating new derived data sets that tie back to the original one.

And that relationship is a really important thing to capture, because, let’s say, there was a problem with one of your data sources—you know, there was a correction that needed to be made. You need to understand how that flows through the entire pipeline of more refined data sets and models that you’re producing so that you can go back and correct it. 


And that’s where this governance stuff comes into play. This is something that’s really, you know, infused at every step of the journey.

It means collecting metadata—you know, data about your data—you know the right kinds of information about the tables in your data sets and how they relate to one another. It means being able to enforce policies so that as an organization we use the data the way it’s meant to be used, the way it’s intended to be used, the way it’s acceptable to be used to drive the business forward. That’s really something that can’t be bolted on after the fact; that’s something has to be present throughout the entire lifecycle.

And if we stopped here, we haven’t really changed anything. It’s only by getting these insights that we’re producing in this data lake back out into the real world that we’re able to deliver on the business promise of these data lakes that we’re all investing in.

Application of the data lakes

And that’s where this Apply step comes in. This can take a few different forms. You might be building, simply, dashboards that are helping business executives make smarter decisions about where to take the business forward, about new projects to invest in.

Or you might be building smarter applications that are able to make intelligent recommendations to the users of those apps based on historical purchase data.

Increasingly, we’re also seeing a lot of process automation, where an intelligent model can smooth over some typically manual business processes and create a more intelligent experience end-to-end based on this rich, data-driven understanding of the problem at hand.

Iteration and a cycle of new data

And, really, this whole process iterates back, right. Those more intelligent applications, they end up generating new data and the cycle continues. 

And so that—in a nutshell, at a very high level—is what a data lake does.

The AI ladder and collecting data

Some of you may have heard us talk about the ladder to AI, the AI ladder; and when we talk about that: 

  1. We talk about collecting data. 
  2. We talk about organizing data. 
  3. We talk about analyzing. 
  4. We talk about infusing.

And really those four steps on this ladder are things that you can see represented throughout this data lake environment.

Clearly, over here, we’re doing a lot of collection of these individual sources of data.

This data preparation and feature extraction step, in a governed fashion, is absolutely what we mean by the organizing of data.

ML model training is a key example of data analysis.

And we talk about infusing the insights from the data lake into the applications, that’s really this last step here.

The data lake is a vehicle to climb the AI ladder 

And so, there is very much a clear linkage between climbing this AI ladder and a data lake as a vehicle that can help you make that journey.

Was this article helpful?

More from Cloud

A clear path to value: Overcome challenges on your FinOps journey 

3 min read - In recent years, cloud adoption services have accelerated, with companies increasingly moving from traditional on-premises hosting to public cloud solutions. However, the rise of hybrid and multi-cloud patterns has led to challenges in optimizing value and controlling cloud expenditure, resulting in a shift from capital to operational expenses.   According to a Gartner report, cloud operational expenses are expected to surpass traditional IT spending, reflecting the ongoing transformation in expenditure patterns by 2025. FinOps is an evolving cloud financial management discipline…

IBM Power8 end of service: What are my options?

3 min read - IBM Power8® generation of IBM Power Systems was introduced ten years ago and it is now time to retire that generation. The end-of-service (EoS) support for the entire IBM Power8 server line is scheduled for this year, commencing in March 2024 and concluding in October 2024. EoS dates vary by model: 31 March 2024: maintenance expires for Power Systems S812LC, S822, S822L, 822LC, 824 and 824L. 31 May 2024: maintenance expires for Power Systems S812L, S814 and 822LC. 31 October…

24 IBM offerings winning TrustRadius 2024 Top Rated Awards

2 min read - TrustRadius is a buyer intelligence platform for business technology. Comprehensive product information, in-depth customer insights and peer conversations enable buyers to make confident decisions. “Earning a Top Rated Award means the vendor has excellent customer satisfaction and proven credibility. It’s based entirely on reviews and customer sentiment,” said Becky Susko, TrustRadius, Marketing Program Manager of Awards. Top Rated Awards have to be earned: Gain 10+ new reviews in the past 12 months Earn a trScore of 7.5 or higher from…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters