How do data lakes in the cloud work?
In this overview video, I’m going to go through the architecture of a data lake in the cloud and discuss the many benefits it can provide an enterprise.
Learn more about data lakes
Data Lakes in the Cloud
Hello, this is Torsten Steinbach, an architect at IBM for Data and Analytics in the cloud, and I’m going to talk to you about data lakes in the cloud.
The center of a data lake in the cloud is the data persistency itself—so, we talk about persistency of data. The data itself in the data lake in the cloud is persisted in object storage.
But we don’t just persist the data itself, we also persist information about the data, which is on one side about indexes. So, we need to index the data so that we can make use of this data in the cloud data lake efficiently. And we also need to store metadata about the data in the catalog.
So, this is our persistency of the data lake—now the question is how do we get this data into the data lake?
So, there are different types of data that we can ingest, so we need to talk about ingestion of data, and we can have a situation that some of your data that is already persisted in databases.
So, these can be relational databases and can also be other operational databases, NoSQL databases, and so on.
And then we get this data into our data lake.
There are actually two fundamental economic mechanisms.
One is basically an ETL, which stands for Extract, Transform, Load, and this is done in a batch fashion.
And a typical mechanism to do ETL is using SQL, and since we’re talking about cloud data lakes, this is SQL-as-a-service now.
But there’s also an addition if you combine those things, the mechanism of replication which is basically more of the change feeds or after you may have batched ETL the initial data set we talk about will be replicated with all of the changes that come in after this initial batch.
Next, we may have data that is not persistent yet at all which is generated as we are speaking here for instance from devices. So, we may have things like IoT devices, driving cars, and the like.
And they are actually producing a lot of IoT messages, all the time, continuously, and they also need to basically stream into the data lake. So, here we’re talking about the streaming mechanism.
In a very similar manner, we are taught that we have data that is originated from applications that are running in the cloud or services that are used by your applications. They’re all producing logs, and that’s very valuable information, especially if you’re talking about operational optimizations and getting business insights of your user behavior for these kind of things. This is very important data that we need to get hold of.
So, logs also need a streaming mechanism to basically get streamed and stored in object storage.
And finally, you may have a situation that you do already have data sitting around in local disks. So, you may have local disks, maybe on your own machine. You may have even a local data lake, a classical data lake, not in the cloud, and typically these are Hadoop clusters that you have on-premise in your enterprise, or it can be as simple as used very frequently just as NFS shares that are used in your team and your enterprise to store certain data.
And if you want to basically get them to a data lake, you also need a mechanism, and it’s basically an upload mechanism. So, a data lake needs to provide you an efficient mechanism to upload data from ground to cloud, this means from on-premises into object storage in cloud.
Now, the next thing we need to do when a bunch of data is here is process it.
This is especially important if you’re talking about data that hasn’t gone through an initial processing, like for instance device data, application data—this is pretty raw data that has a very raw format, that is very volatile, that has very different structures, changing schema. And sometimes it doesn’t have a real structure which can be binary data, let’s say images that are being taken by a device’s cameras and I need to extract features from it.
So, we’re talking about feature extraction from this data to this data. But even if you already have no structure extracted, it might still need a lot of cleansing—you may have to basically normalize it to certain units, you may have to load it up to certain time boundaries to get rid of null values, and these kind of things. So, there’s a lot of things that you need to do about transformation—you need to transform the data.
Once you have transformed the data, basically you now have the data that you can potentially now use for other analytics, but one additional thing is advisable that you should do with this data—you could create an index for this data so we will know more about the data and can get proficient, performance analytics.
And finally, you should also leverage this data and need to tell a data lake this by cataloging the data. So, there are multiple steps that often when we talk about the pipeline of data transformations that need to be done here.
Now the question is, what do we use here?
And there are actually two processes, two mechanisms, two services, or types of services that are especially suited for this type of processing. One is function-as-a-service and the other one is SQL-as-a-service again.
So, with SQL- and function-as-a-service, you can do this whole range of things here—you can basically create indexes through SQL DDLs, it also can create tables through SQL DDLs, you can transform data when you can use functions with custom libraries and custom code to do future extractions from the format of the data that you need to process.
Once we have gone through this pipeline the question is what’s next now? So, we have prepared, we have processed all of this data and we have probably cataloged it, so we know of what data we have.
Now it comes to the point that we really harvest all of this work by basically generating insights. So, generating insights is on one side of the whole group of business intelligence, which consists of things like doing reporting, or creating dashboards, and that’s what’s typically often referred to as BI. And one option that is possible now is to simply directly do basically BI against this data in a data lake.
But actually, it turns out that this is especially useful for an option for batch ETL options, like creating reports in a batch function. Because when it comes to more interactive requirements, you need basically sitting in front of the screen and you need to refresh it in a sub-second. Let’s see dashboard here, there is actually another very important mechanism that is very well established and it is part of this whole data lake ecosystem and this is a data warehouse.
So, data warehouse or a database is highly optimized and has a lot of mechanisms for giving you low latency and also guaranteed response times for your queries.
So, the question is, how do we do that?
Now, we obviously need to move this data one step further after it has gone through all of the data preparation in the data lake with an ETL again.
And it happens to be again that SQL-as-a-service is a useful mechanism because we already use it to ETL data into the data lake. Now we can also use it to ETL data out of this data lake into a data warehouse so that it’s now in this – I would say more traditional, established stack off doing BI that can be used by your BI tools, reporting tools, dashboarding tools to do interactive BI with performance and response-time SLAs.
So, that’s one end-to-end flow now, but very obviously, insights is more than just doing reporting and dashboarding. So, there’s a whole domain of tools and frameworks out there for more advanced types of analytics, such as machine learning, or simply using data signs, tools, and framework that now you basically do also analytics and artificial intelligence against the data that we’ve prepared here in the catalog.
And machine learning tools and data science tools, basically they all have very strong support for accessing data in an object storage. So that’s why this is a good fit basically—let them connect directly here to this data lake.
Now, that is the end-to-end process basically getting from your data with the help of a data lake into insights.
One of the big problems that is there today is for people to do that to prove and explain how they got to his insight? How can you trust this insight? How can you reproduce this insight?
So, one of the key things that need to be part of this picture is data governance.
So, data governance in this context has two main things that we need to take care of. One is we need to be able to track the lineage of your data because you’ve seen the data is traveling from different sources, from preparation into some insights in the form of a report.
And you need to be able to track back—where did this report come from? Why is it looking like this? What’s the data that basically produced it?
And the other things are you need to be able to enforce what a data lake can actually be able to enforce—policies, governance policies.
Who is able to access what? Who is able to see personal information? Can I access it directly or only in an anonymized masked form?
So, these are all governance rules, and there are governance services available also in the cloud that basically a data lake needs to apply with and use in order to track all of this.
Deploying a pipeline of data and automate
So, we almost done with this overall data lake introduction, but there is just one more thing that I want to highlight and this is since we’re talking about in the cloud—how can I deployed my entire pipeline of traveling data for this whole infrastructure and how can I automate that.
And here basically function-as-a-service plays a special role because function-as-a-service has a lot of mechanisms that can that I can use to schedule and automate things like for instance batch ETL step, like basically generating a report.
So, this is the final thing that we need in our data lake in order to automate and operationalize my entire data and analytics using a data lake.