Advantages and complexities of integrating Hadoop with object stores

Share this post:

Object storage is the ultimate solution for storing unstructured data today.

It is low cost, secure, fault tolerant, and resilient. Object stores expose a RESTful API, where each object has a unique URL, enabling users to easily manipulate object stores.  Major cloud vendors like IBM, Amazon, and Microsoft all provide cloud-based object storage services.

Fortunately, Hadoop integrates easily with object stores; not just the Hadoop Distributed File System (HDFS). The Hadoop code base contains a variety of storage connectors that offer access to different object stores.

All the connectors share the file system interface and can easily be integrated into various MapReduce flows. These connectors have become the standard choice for other big data engines, such as Apache Spark, that interact with object stores.

Hadoop object stores

Data locality no longer exists with this approach, comparing to the HDFS. But the abilities to scale compute and storage independently means greatly reduced operational costs. There is no need to copy data into the HDFS cluster, and Hadoop can access object stores directly. Typical use cases employ object stores to store Internet of Things (IoT) or archive data, then use the Hadoop ecosystem to run analytic flows directly on the stored data.

Hadoop file system shell operations and why object storage is not a file system

While Hadoop file system shell operations are widely used with HDFS, most of them are not built to work with object stores, or object stores generally don’t behave as expected. Shell operations treat object storage like a file system, which is a common mistake.

Directories in HDFS are the core component used to group files into different collections. For example, the following will recursively create three directories “a,” ”a/b” and “a/b/c.”

This enables an upload into “a/b/c” via:

Object stores, however, have different semantics. The data object URI consists of the bucket and the object name, where an object name may contain delimiters. The subsequent bucket listing may use both delimiter and prefix, and thus retrieve only relevant data.

This has a certain analogy with listing directories in file systems. However, write operation patterns in object stores are very different compared to file systems.

For example, in an object store, a single RESTful PUT request will create an object “a/b/c/data.txt” in “mybucket” without having to create “a/b/c” in advance.

This happens because object stores support hierarchical naming and operations without the need for directories.

Move command is another interesting example:

Move command internally uses rename. In a file system, rename is an atomic operation. Normally, any new file is first written into a temp file and upon completion is renamed to the final name. This allows the file system to be consistent and stable when it comes to failures, and ensures that only complete files exist.

The rename operation is an integral part of any Hadoop write flow. On the other hand, object stores don’t provide an atomic rename. In fact, rename should be avoided in object storage altogether, since it consists of two separate operations: copy and delete.

Copy is usually mapped to a RESTful PUT request or RESTful COPY request and triggers internal data movements between storage nodes. The subsequent delete command maps to the RESTful DELETE request, but usually relies on the bucket listing operation to identify which data must be deleted. This makes a rename highly inefficient in object stores, and the lack of atomicity may leave data in a corrupted state.

Hadoop file system shell operations are part of the Hadoop ecosystem, but many of the operations, such as creating directories and rename operations, are better avoided with object stores. In fact, all write flows from Hadoop shell operations should be avoided with object stores. Object stores provide a CLI interface, which is preferable to Hadoop shell operations.

In my next post, I will explain the actual costs of shell operations and some of the issues addressed by the Stocator project. Stocator offers superior performance compared to other connectors, and it’s being used as part of the IBM Data Science Experience.

To hear more, attend my joint talk with Trent Gray-Donald: “Hadoop and object stores: Can we do it better?” at the next Strata Data Conference, 23 – 25 May 2017, in London. I will be also presenting with Graham Mackintosh: “Very large data files, object stores, and deep learning – lessons learned while looking for signs of extra-terrestrial life” at Spark Summit, San Francisco, 5 – 7 June 2017.

More Storage stories

Chatmantics improves customer experiences by replacing call center IVR with AI on IBM Cloud

We’ve all called a company’s customer service number only to be greeted by an interactive voice response (IVR) system. The IVR will say to press one for this and two for that and so on. You must wait for the correct number to enter your account information “so that we may better serve you” but […]

Continue reading

Comprehensive strategy for enterprise cloud: Where power meets flexibility

Enterprises have embraced the cloud: according to the 2018 IDG Cloud Computing study, 73 percent of enterprises surveyed now host at least a portion of their enterprise computing infrastructure on the cloud, and many have lifted and shifted their simplest workloads to the cloud. But recent survey data also reveals multiple concerns about security, vendor […]

Continue reading

How an AI application is helping improve quality control in the egg farming industry

People typically open a carton of eggs before buying it to be sure none of the eggs are cracked. No one wants to deal with the mess of cracked eggs. For egg farmers, one “bad egg” in a carton means the retail store must be credited for the entire carton. That could potentially waste as […]

Continue reading