Cutting the cord: separating data from compute in your data lake with object storage
When five nines isn’t enough
Using Hadoop-based data lakes to manage and analyze big data is a great idea in theory, but many organizations are struggling to make them work in practice. One of the biggest issues is that, when you are dealing with truly massive volumes of data, much of the received wisdom about IT availability and reliability ceases to apply.
For example, traditionally, most IT teams aim to configure their systems to provide at least “five nines”, or 99.999 percent resilience. However, when your data lake contains petabytes of data, if you don’t proactively protect against bit rot, disk failure, and other forms of corruption or data loss, the risk of losing data becomes a certainty.
The problem is exacerbated by the fact that the Hadoop Distributed File System (HDFS) is not an ideal medium for long-term, large-scale data storage. To achieve high availability with HDFS, you need to store multiple copies of every file—and if one of the copies does degrade through bit-rot, there is no proactive monitoring in place to detect and repair the problem.
Moreover, storing data in HDFS is relatively expensive, because each node of a Hadoop cluster isn’t just a storage system—it’s a full-fledged server with processors and memory resources too. The more copies of data you keep, the more compute resources you need to pay for; even though you don’t really need that extra computational horsepower to help you analyze your data.
So, what’s the alternative?
The solution to this problem is to re-architect your data lake to separate your data from your compute resources.
This may seem counter intuitive, because one of the original principles of a Hadoop-based data lake was to analyze data where it is stored, instead of having to move it into a new environment every time you want to analyze it. To explain why this apparent contradiction doesn’t hold water, we’ll need to take a quick digression into the history of big data analytics.
Before Hadoop became widely adopted, the dominant paradigm for analytics was to use a data warehouse. Data warehouses are extremely good at querying structured data, but were never designed to handle the variety of big data types that Hadoop was built to analyze.
To deal with these large datasets in a traditional data warehouse, you would have to cut them into segments, feed them into your data warehouse one at a time via complex ETL processes, allow the appliance’s central analytics engine to process them, and then somehow aggregate the results. The overhead and complexity of these efforts are often prohibitive, and so, very few organizations believed that it was worth the effort to do it.
Hadoop provided a revolutionary solution to this problem. Instead of chopping up a large data set and feeding it to a central analytics engine serially, you would build a distributed network of low-cost storage and processing nodes that would hold the entire dataset at once. Rather than moving the data to the processing engine, you would distribute the analytics job across the nodes, and each node would work in parallel, analyzing the data that it held.
Large data sets vs. data lakes
Given this history, why are we arguing that should we move to an architecture that explicitly separates data from computation, where data has to be provisioned to a computation engine before it can be analyzed? On the face of it, this seems to negate the advantages of using Hadoop in the first place.
The point we need to understand is that there is a difference between a large data set and an entire data lake. Within your data lake you may have many large data sets. But you will almost never need to analyze every data set in your data lake at the same time, in the same job.
For this reason, there is no need to store all of your data sets permanently in HDFS. Instead, you should treat Hadoop as an ad hoc data processing environment. When your data scientists need to analyze a specific data set, you can simply spin up a temporary bespoke cluster with an appropriate number of nodes, load the data into it, and run whatever data processing jobs are required. (In fact, with Apache Spark, users no longer need to copy data into a cluster before running their job—they can read data directly from an object store from within the job). Once a user’s analysis is complete, you can deprovision the cluster and return its nodes to the pool.
Meanwhile, the rest of your data should live in an environment that is specifically built for highly resilient and cost-effective long-term management of large data sets: an object storage repository, such as IBM Cloud Object Storage.
Enjoy the benefits of object storage
Like Hadoop, object storage solutions are built from clusters of nodes. The difference is that the majority of the nodes are simply commodity storage devices—they don’t require the same processor or memory resources as a Hadoop node, so an object storage cluster can scale much more cost-effectively as data volumes rise.
Because object storage solutions are designed for long-term data storage, they also typically offer much more advanced resiliency and availability features. For example, IBM Cloud Object Storage uses Information Dispersal Algorithms (IDA) that makes it possible to recover data even in the event of multiple disk failures, without needing to keep multiple copies of each data set.
Depending on the configuration of the IDA, it’s possible to achieve sufficiently high reliability and availability for even the most demanding applications and massive data storage requirements. Most data lakes will not require this extreme level of protection, but the algorithm provides a high degree of flexibility, enabling storage administrators to choose an appropriate balance between resiliency and cost-efficiency.
Finally, offloading data storage from your Hadoop cluster to object storage can dramatically simplify maintenance tasks such as patching and upgrades. Since none of your data is held permanently in the Hadoop cluster, it is trivial to take a node offline and install new software packages. There is no risk that any important data will be lost during the upgrade, because the data no longer lives in Hadoop—it’s all held safely in the object store.
In conclusion, by decoupling your data from your processing engine, you can unlock Hadoop’s true potential. Instead of a single Hadoop cluster serving all users, you can give each user a Hadoop environment of their own—provisioning only the software and tools that each data scientist needs to do their research. At the same time, you can keep your data in a highly resilient and security-rich environment, while also storing fewer copies of each dataset and keeping costs low.
If you’d like to learn more, you can read about the features and benefits of IBM Cloud Object Storage here. To see how to configure a Hadoop cluster with an object storage service, check out the video here.