What is HDFS (Hadoop Distributed File System)?

To understand how it’s possible to scale an Apache® Hadoop® cluster to hundreds (and even thousands) of nodes, you have to start with the Hadoop Distributed File System (HDFS). Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster.

In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing.

What’s the goal?

The goal of Hadoop is to use commonly available servers in a very large cluster, where each server has a set of inexpensive internal disk drives.

For higher performance, MapReduce tries to assign workloads to these servers where the data to be processed is stored. This is known as data locality. It’s because of this principle that using a storage area network (SAN), or network attached storage (NAS), in a Hadoop environment, is not recommended.

For Hadoop deployments using a SAN or NAS, the extra network communication overhead can cause performance bottlenecks, especially for larger clusters.) Now take a moment and think of a 1000-machine cluster, where each machine has three internal disk drives; then consider the failure rate of a cluster composed of 3000 inexpensive drives + 1000 inexpensive servers!

We’re likely already on the same page here: The component mean time to failure (MTTF), you’re going to experience in a Hadoop cluster is likely anal-ogous to a zipper on your kid’s jacket: it’s going to fail (and poetically enough, zippers seem to fail only when you really need them). The cool thing about Hadoop is that the reality of the MTTF rates associated with inexpensive hardware is actually well understood (a design point if you will), and part of the strength of Hadoop is that it has built-in fault tolerance and fault compensation capabilities.

This is the same for HDFS, in that data is divided into blocks, and copies of these blocks are stored on other servers in the Hadoop cluster. That is, an individual file is actually stored as smaller blocks that are replicated across multiple servers in the entire cluster.

Related products or solutions

IBM Big SQL screenshot

IBM Big SQL

A hybrid SQL engine for Apache Hadoop that concurrently exploits Hive, HBase and Spark using a single database connection or a single query.

Learn more

Resources

 

The Data Warehouse Evolved: A Foundation for Analytical Excellence

ReExplore a Best-in-Class approach to data management and how companies are prioritizing data technologies to drive growth and efficiency.

 

Understanding Big Data Beyond the Hype

Read this practical introduction to the next generation of data architectures that introduces the role of the cloud and NoSQL technologies and discusses the practicalities of security, privacy and governance.

 

Building IBM InfoSphere DataStage Jobs to Process JSON Files on an Hadoop HDFS File System

In this video series, we will build a DataStage job that uses the DataStage Big Data stage to copy the JSON blog file from the Hadoop HDFS file system to the DataStage Server system.

 

How to...

Learn how to get started with Streaming Analytics and BigInsights on Bluemix using HDFS.