Scale an Apache Hadoop cluster to hundreds of nodes with the Hadoop Distributed File System (HDFS)
Book a consultation The IBM + Cloudera partnership
What is HDFS?

HDFS is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN. HDFS should not be confused with or replaced by Apache HBase, which is a column-oriented non-relational database management system that sits on top of HDFS and can better support real-time data needs with its in-memory processing engine.

The goals of HDFS
Fast recovery from hardware failures

Because one HDFS instance may consist of thousands of servers, failure of at least one server is inevitable. HDFS has been built to detect faults and automatically recover quickly.

Access to streaming data

HDFS is intended more for batch processing versus interactive use, so the emphasis in the design is for high data throughput rates, which accommodate streaming access to data sets.

Accommodation of large data sets

HDFS accommodates applications that have data sets typically gigabytes to terabytes in size. HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in a single cluster.


To facilitate adoption, HDFS is designed to be portable across multiple hardware platforms and to be compatible with a variety of underlying operating systems.

Resources The Data Warehouse Evolved: A Foundation for Analytical Excellence

Explore a best-in-class approach to data management and how companies are prioritizing data technologies to drive growth and efficiency.

Understanding big data beyond the hype

Read this practical introduction to the next generation of data architectures. It introduces the role of the cloud and NoSQL technologies and discusses the practicalities of security, privacy and governance.

Engage with an expert

Schedule a no-cost, one-on-one call with an IBM big data expert to learn how we can help you extend data science and machine learning across the Apache Hadoop ecosystem.