Advantages and complexities of integrating Hadoop with object stores

Share this post:

Object storage is the ultimate solution for storing unstructured data today.

It is low cost, secure, fault tolerant, and resilient. Object stores expose a RESTful API, where each object has a unique URL, enabling users to easily manipulate object stores.  Major cloud vendors like IBM, Amazon, and Microsoft all provide cloud-based object storage services.

Fortunately, Hadoop integrates easily with object stores; not just the Hadoop Distributed File System (HDFS). The Hadoop code base contains a variety of storage connectors that offer access to different object stores.

All the connectors share the file system interface and can easily be integrated into various MapReduce flows. These connectors have become the standard choice for other big data engines, such as Apache Spark, that interact with object stores.

Hadoop object stores

Data locality no longer exists with this approach, comparing to the HDFS. But the abilities to scale compute and storage independently means greatly reduced operational costs. There is no need to copy data into the HDFS cluster, and Hadoop can access object stores directly. Typical use cases employ object stores to store Internet of Things (IoT) or archive data, then use the Hadoop ecosystem to run analytic flows directly on the stored data.

Hadoop file system shell operations and why object storage is not a file system

While Hadoop file system shell operations are widely used with HDFS, most of them are not built to work with object stores, or object stores generally don’t behave as expected. Shell operations treat object storage like a file system, which is a common mistake.

Directories in HDFS are the core component used to group files into different collections. For example, the following will recursively create three directories “a,” ”a/b” and “a/b/c.”

This enables an upload into “a/b/c” via:

Object stores, however, have different semantics. The data object URI consists of the bucket and the object name, where an object name may contain delimiters. The subsequent bucket listing may use both delimiter and prefix, and thus retrieve only relevant data.

This has a certain analogy with listing directories in file systems. However, write operation patterns in object stores are very different compared to file systems.

For example, in an object store, a single RESTful PUT request will create an object “a/b/c/data.txt” in “mybucket” without having to create “a/b/c” in advance.

This happens because object stores support hierarchical naming and operations without the need for directories.

Move command is another interesting example:

Move command internally uses rename. In a file system, rename is an atomic operation. Normally, any new file is first written into a temp file and upon completion is renamed to the final name. This allows the file system to be consistent and stable when it comes to failures, and ensures that only complete files exist.

The rename operation is an integral part of any Hadoop write flow. On the other hand, object stores don’t provide an atomic rename. In fact, rename should be avoided in object storage altogether, since it consists of two separate operations: copy and delete.

Copy is usually mapped to a RESTful PUT request or RESTful COPY request and triggers internal data movements between storage nodes. The subsequent delete command maps to the RESTful DELETE request, but usually relies on the bucket listing operation to identify which data must be deleted. This makes a rename highly inefficient in object stores, and the lack of atomicity may leave data in a corrupted state.

Hadoop file system shell operations are part of the Hadoop ecosystem, but many of the operations, such as creating directories and rename operations, are better avoided with object stores. In fact, all write flows from Hadoop shell operations should be avoided with object stores. Object stores provide a CLI interface, which is preferable to Hadoop shell operations.

In my next post, I will explain the actual costs of shell operations and some of the issues addressed by the Stocator project. Stocator offers superior performance compared to other connectors, and it’s being used as part of the IBM Data Science Experience.

To hear more, attend my joint talk with Trent Gray-Donald: “Hadoop and object stores: Can we do it better?” at the next Strata Data Conference, 23 – 25 May 2017, in London. I will be also presenting with Graham Mackintosh: “Very large data files, object stores, and deep learning – lessons learned while looking for signs of extra-terrestrial life” at Spark Summit, San Francisco, 5 – 7 June 2017.

More Storage stories

French insurer teams with IBM Services to develop fraud detection solution

Auto insurance fraud costs companies billions of dollars every year. Those losses trickle down to policyholders who absorb some of that risk in policy rate increases. Thélem assurances, a French property and casualty insurer whose motto is “Thélem innovates for you”, has launched an artificial intelligence program, prioritizing a fraud detection use case as its […]

Continue reading

Cloud innovation in real estate: Apleona and IBM rely on new technologies

Digitization does not stop at the proverbial concrete gold — real estate. In fact, the real estate industry is on the move. Companies are realizing the benefits of digital transformation and are capitalizing on the power of new technologies such as cloud, AI and blockchain. Take, for example, Apleona GmbH, one of Europe’s largest real […]

Continue reading

Innovate with Enterprise Design Thinking in the IBM Garage

We’ve all been there. You have an amazing idea that’s really exciting. Maybe it’s a home improvement project, or perhaps it’s a new business idea. You think about all the details required to make it real. But, once you get to the seventh action item, you’re not so excited anymore. Sometimes when we realize the […]

Continue reading