What frustrates data scientists working on a Hadoop project?

Share this post:

Recently I talked to a data scientist experimenting with various Hadoop distributions and learned that the first step involved in any Hadoop project — before any analytics jobs can be run — is to copy the data to be analyzed into HDFS storage. HDFS is the Hadoop Distributed File System purpose built for Hadoop-based analytics. It’s a scalable file system that runs on a storage-rich commodity server-based cluster.

But the copy process can take days, depending on how large the data set is; it’s not uncommon to copy hundreds of terabytes of data into HDFS because it typically involves copying multiple data sets from different business functions. The whole point of big data analytics is to find insights by analyzing data across the various departments or functions in an enterprise. And by the time large sets of data are copied into HDFS the data becomes stale. This is what frustrates data scientists — the time taken to copy data and then the analysis of what is now stale data.

So how do we solve this problem? What if there was no need to copy your enterprise data to an isolated storage silo? That’s exactly what IBM Spectrum Scale storage allows you to do — you can run Hadoop analytics directly on IBM Spectrum Scale storage and avoid the whole copy-to-HDFS headache. IBM Spectrum Scale is an industry-proven high-performance scalable storage based on IBM’s General Parallel File System (GPFS). If you store all of your enterprise data with IBM Spectrum Scale, then there is no need to copy that data to HDFS to run Hadoop analytics because Spectrum Scale supports HDFS APIs.  Not having to copy your enterprise data to an isolated storage silo improves your productivity, and there are some other key efficiencies Spectrum Scale offers over HDFS:

Unified storage:  Analytics results are immediately available to any enterprise application through industry standard file or block sharing protocols like NFS, SMB or iSCSI, and also to modern web applications that use object interfaces like S3 or Swift.

Erasure coding: Spectrum Scale uses erasure coding for data protection and availability, whereas HDFS keeps 3 copies of data, which can quickly add up as the Hadoop cluster grows – a very common problem in large enterprises. With erasure coding the overhead is 20 percent compared to 200 percent with 3 copies of data.

Shared storage: HDFS is a shared nothing (SN) architecture in which a cluster grows by adding new nodes where each new node adds both compute and storage. Spectrum Scale can be deployed in shared nothing or in shared storage mode for workloads that demand high read/write throughput by decoupling storage from compute.

Tier to the cloud: Spectrum Scale has built-in policy-based tiering which allows your older or cold data to tier to cost-effective cloud storage automatically.

Last but not least, Spectrum Scale is going to be supported by Hortonworks, a key Hadoop distribution. See the press release here.

Those are some of the key advantages of Spectrum Scale over HDFS. They not only eliminate the headaches of copying your enterprise data to isolated storage but also improve the efficiency of Hadoop storage and the productivity of your data scientists.

To learn more about how IBM Spectrum Scale storage can help improve your Hadoop jobs, join us at the IBM booth at the DataWorks Summit in Munich.

More Storage stories

Open flexibility with infrastructure-independent software

Hybrid cloud, Multicloud, Storage

There is a rapid industry transition underway to a hybrid multicloud style of computing. It’s happening across all industries and all lines of business and being pushed along by stakeholders from IT managers to the C-suite and shareholders. Although the transition to hybrid multicloud infrastructure is touching every area of IT, its impact on storage more

Met Office gives businesses the upper hand with fast, accurate weather data

Hybrid cloud, Mainframes, Storage

From time to time, we invite industry thought leaders to share their opinions and insights on current technology trends to the IBM Systems IT Infrastructure blog. The opinions in these posts are their own, and do not necessarily reflect the views of IBM. Weather is one of the few things in life none of us more

IBM refreshes enterprise storage, bringing strong new storage technologies to IBM Z and IBM LinuxONE customers

Digital transformation, Mainframes, Storage

Written by Eric Burgener, Research Vice President, Infrastructure Systems, Platforms and Technologies Group – IDC, Sponsored by IBM With digital transformation (DX) underway in most enterprises, there is an ever-increasing need for higher performance, greater than “five nines” availability, and flexible data protection options that give customers the capability to implement a cost-effective approach in more