What is HDFS?

The two major components of Apache Hadoop are HDFS and YARN. HDFS is used scale a single cluster to hundreds (and even thousands) of nodes. It is a distributed file system that handles large data sets running on commodity hardware.  

What’s the goal?

The goal of Hadoop is to use commonly available servers in a very large cluster, where each server has a set of inexpensive internal disk drives.

For higher performance, MapReduce assigns workloads to those servers where the data is to be processed and stored. This is known as data locality. It’s because of this principle that using a storage area network (SAN), or network attached storage (NAS), in a Hadoop environment, is not recommended.

For Hadoop deployments using a SAN or NAS, the extra network communication overhead can cause performance bottlenecks, (especially for larger clusters.) Now take a moment and think of a 1000-machine cluster, where each machine has three internal disk drives; then consider the failure rate of a cluster composed of 3000 inexpensive drives + 1000 inexpensive servers!

We’re likely already on the same page here: The component mean time to failure (MTTF), you’re going to experience in a Hadoop cluster is likely anal-ogous to a zipper on your kid’s jacket: it’s going to fail (and poetically enough, zippers seem to fail only when you really need them). The cool thing about Hadoop is that the reality of the MTTF rates associated with inexpensive hardware is actually well understood (a design point if you will), and part of the strength of Hadoop is that it has built-in fault tolerance and fault compensation capabilities.

This is the same for HDFS, in that data is divided into blocks, and copies of these blocks are stored on other servers in the Hadoop cluster. That is, an individual file is actually stored as smaller blocks that are replicated across multiple servers in the entire cluster.

An example of HDFS

Think of a file that contains the phone numbers for everyone in the United States; the people with a last name starting with A might be stored on server 1, B on server 2, and so on.

In a Hadoop world, pieces of this phonebook would be stored across the cluster, and to reconstruct the entire phonebook, your program would need the blocks from every server in the cluster. To achieve availability as components fail, HDFS replicates these smaller pieces onto two additional servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole environment; for example, a development Hadoop cluster typically doesn’t need any data redundancy.) This redundancy offers multiple benefits, the most obvious being higher availability.

In addition, this redundancy allows the Hadoop cluster to break work up into smaller chunks and run those jobs on all the servers in the cluster for better scalability. Finally, you get the benefit of data locality, which is critical when working with large data sets. We detail these important benefits later in this chapter.

Products

IBM® Big SQL

Big SQL is a hybrid SQL engine for Hadoop delivering easy data querying across the enterprise. Use a single database connection or query to connect to disparate sources such as HDFS, and also RDMS, NoSQL databases, object stores and WebHDFS 

Apache Hadoop

Apache Hadoop provides a highly scalable, cost effective storage platform for large data volumes with no format requirements.HDFS is a main component of Hadoop.  

Resources

The Data Warehouse Evolved: A Foundation for Analytical Excellence

ReExplore a Best-in-Class approach to data management and how companies are prioritizing data technologies to drive growth and efficiency.

Understanding Big Data Beyond the Hype

Read this practical introduction to the next generation of data architectures that introduces the role of the cloud and NoSQL technologies and discusses the practicalities of security, privacy and governance.