Updated: 14 June 2024
Contributor: Jim Holdsworth
Hadoop Distributed File System (HDFS) is a file system that manages large data sets that can run on commodity hardware. HDFS is the most popular data storage system for Hadoop and can be used to scale a single Apache Hadoop cluster to hundreds and even thousands of nodes. Because it efficiently manages big data with high throughput, HDFS can be used as a data pipeline and is ideal for supporting complex data analytics.
HDFS is built on an open source framework and is one of the major components of Apache Hadoop, the others being MapReduce and YARN. HDFS should not be confused with or replaced by Apache HBase, which is a column-oriented, non-relational database management system that sits on top of HDFS and can better support real-time data needs with its in-memory processing engine.
Learn key benefits of generative AI and how organizations can boost their business.
Register for the ebook on AI data stores
Because one HDFS instance might consist of thousands of servers, failure of at least one server is always a possibility. HDFS has been built to detect faults and automatically recover quickly. Data replication with multiple copies across many nodes helps protect against data loss. HDFS keeps at least one copy on a different rack from all other copies. This data storage in a large cluster across nodes increases reliability. In addition, HDFS can take storage snapshots to save point-in-time (PIT) information.
HDFS is intended more for batch processing versus interactive use, so the emphasis in the design is for high data throughput rates, which accommodate streaming access to data sets.
HDFS accommodates applications that use data sets typically from gigabytes to terabytes in size. HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in a single cluster and help drive high-performance computing (HPC) systems. Data lakes are often stored on HDFS. Data warehouses have also used HDFS, but now less often, due to perceived complexity of operation.
Because the data is stored virtually, the costs for file system metadata and file system namespace data storage can be reduced.
To facilitate adoption, HDFS is designed to be portable across multiple hardware platforms and to be compatible with various underlying operating systems, including Linux, macOS and Windows. In addition, Hadoop data lakes are able to support databases that are unstructured, semistructured and structured, for maximum flexibility. While Hadoop is coded in Java, other languages—including C++, Perl, Python and Ruby) enable its use in data science.
HDFS uses a cluster architecture to help deliver high throughput. To reduce network traffic the Hadoop file system stores data in DataNodes where computations take place, rather than moving the data to another location for computation.
With both horizontal and vertical scalability features, HDFS can be quickly adjusted to match an organization’s data needs. A cluster might include hundreds or thousands of nodes.
HDFS has a director/worker architecture.
Both the NameNode and DataNode are software written to run on a wide variety of operating systems (OS), which is often the GNU/Linux OS. The Java language was used in building HDFS, meaning that any machine supporting Java can also use the NameNode or DataNode software.
Deployments will often have a single dedicated machine that runs the NameNode software. Then, any other machine in the cluster runs a single instance of DataNode software. If needed, but only used infrequently, a configuration of more than one DataNode on a single machine is possible.
When data is brought into HDFS, it’s broken into blocks and distributed to different nodes in a cluster. With the data stored in multiple DataNodes, the blocks can be replicated to other nodes to enable parallel processing. The distributed file system (DFS) includes commands to quickly access, retrieve, move and view data. With replicas of datablocks across multiple DataNodes, one copy can be removed without risking file corruption of the other copies. The default HDFS block size is 128 MB (Hadoop 2.x), which some will consider to be large, but the block size is done to minimize seek times and reduce the metadata needed.
To minimize risk and speed processing, when a DataNode stops signaling the NameNode, that DataNode is removed from the cluster and operations continue without that DataNode. If that DataNode later becomes operational, it is allocated to a new cluster.
HDFS provides flexible data access files through various interfaces: a native Java API is provided with HDFS, while a C language wrapper is available for the Java API, plus an HTTP browser can be used to browse the files of an HDFS instance.
HDFS is organized with a traditional file hierarchy where the user can create directories that contain multiple files. The hierarchy of the file system namespace is similar to traditional file systems, where the user creates and removes files, moves them between directories and can rename files.
The file system namespace is maintained by the NameNode, which maintains records of any changes in the file system namespace. The total number of replicas to be saved for any application can be specified here. That number is the replication factor for that file. The replication factor can be set when the file is created and later modified as needed.
In order to provide reliable storage, HDFS stores large files in multiple locations in a large cluster, with each file in a sequence of blocks. Each block is stored in a file of the same size, except for the final block, which fills as data is added.
For added protection, HDFS files are write-once by only one writer at any time. To help ensure that all data is being replicated as instructed. The NameNode receives a heartbeat (a periodic status report) and blockreport (the block ID, generation stamp and length of every block replica) from every DataNode attached to the cluster. Receiving a heartbeat indicates that the DataNode is working correctly.
The NameNode selects the rack ID for each DataNode by using a process called Hadoop Rack Awareness to help prevent the loss of data if an entire rack fails. This also enables the use of bandwidth from multiple racks when reading data.
Consider a file that includes phone numbers for an entire country. The numbers for people with a surname starting with A might be stored on server 1, B on server 2 and so on. With Hadoop, pieces of this telephone directory would be stored across a single cluster, and to reconstruct the entire phonebook, an application would need the blocks from every server in the cluster.
To help ensure high availability if and when a server fails, HDFS replicates these smaller pieces onto two more servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole environment. For example, a development Hadoop cluster typically doesn’t need any data redundancy.)
This redundancy also enables the Hadoop cluster to break up work into smaller chunks and run those jobs on all the servers in the cluster for better scalability. Finally, an organization gains the benefit of data locality, which is critical when working with large data sets.
HDFS can also enable artificial intelligence (AI) and machine learning (ML) by effectively scaling up. First to store data in large enough quantities required to train ML models and then to access those enormous data sets.
Any organization that captures, stores and uses large datasets—up to petabytes—might consider using HDFS. A few industry-based use cases show how HDFS might be implemented.
The origin of Hadoop, according to cofounders Mike Cafarella and Doug Cutting, was a Google File System paper, published in 2003. A second paper followed, "MapReduce: Simplified Data Processing on Large Clusters." Development of an early search engine named Apache Nutch was begun, but then work moved with Doug Cutting over to Yahoo in 2006.
Hadoop was named after a toy elephant belonging to Cutting’s son. (Hence the logo.) The initial Hadoop code was largely based on Nutch—but overcame its scalability limitations—and contained both the early versions of HDFS and MapReduce.
The suite of programs in the Hadoop Ecosystem continues to grow. In addition to HDFS, there is also:HBase (a NoSQL database), Mahout, Spark MLLib (algorithm libraries for machine learning), MapReduce (programming-based data processing), Oozie (job scheduler), PIG and HIVE (query-based data processing services), Solar and Lucene (for searching and indexing), Spark (data processing, in-memory), YARN (Yet Another Resource Negotiator) and Zookeeper (cluster coordination).
The open source software within the Hadoop Ecosystem is now managed by the Apache Software Foundation1—a worldwide community for software developers and software contributors.
watsonx.data™ is now available: a fit-for-purpose data store built on an open data lakehouse architecture to scale AI workloads, for all your data, anywhere.
IBM® and Cloudera have partnered to offer an industry-leading, enterprise-grade Hadoop distribution, including an integrated ecosystem of products and services to support faster analytics at scale.
At least three different data platform solutions are emerging. Learn about the relationship between data lakehouse, data fabric and data mesh.
Read this practical introduction to the next generation of data architectures. It introduces the role of the cloud and NoSQL technologies and discusses the practicalities of security, privacy and governance.
Presto gives organizations of all sizes a fast, efficient way to analyze big data from various sources including on-premises systems and the cloud.
1 Apache software foundation (link resides outside ibm.com)