HDFS has a director/worker architecture.
- An HDFS cluster includes one NameNode, which is the director server. The NameNode tracks the status of all files, the file permissions and location of every block. The NameNode software manages the file system namespace which in turn tracks and controls client access to the files and performs operations such as file opening, closing and renaming directories and files.
The file system namespace also divides files into blocks and maps the blocks to the DataNodes, which is the worker portion of the system. By configuring with only a single NameNode per cluster, the system architecture simplifies data management and storage of the HDFS metadata. In addition, greater security is built in by keeping user data from flowing through the NameNode.
- Most often there is one DataNode per node in a cluster that manages the data storage within the node. The DataNode software manages block creation, deletion and replication, plus read and write requests. Each DataNode separately stores HDFS data in its local file system with each block as a separate file. DataNodes are the worker nodes (or Hadoop daemons, running processes in the background) and can run on commodity hardware if an organization wants to economize.
Both the NameNode and DataNode are software written to run on a wide variety of operating systems (OS), which is often the GNU/Linux OS. The Java language was used in building HDFS, meaning that any machine supporting Java can also use the NameNode or DataNode software.
Deployments will often have a single dedicated machine that runs the NameNode software. Then, any other machine in the cluster runs a single instance of DataNode software. If needed, but only used infrequently, a configuration of more than one DataNode on a single machine is possible.
When data is brought into HDFS, it’s broken into blocks and distributed to different nodes in a cluster. With the data stored in multiple DataNodes, the blocks can be replicated to other nodes to enable parallel processing. The distributed file system (DFS) includes commands to quickly access, retrieve, move and view data. With replicas of datablocks across multiple DataNodes, one copy can be removed without risking file corruption of the other copies. The default HDFS block size is 128 MB (Hadoop 2.x), which some will consider to be large, but the block size is done to minimize seek times and reduce the metadata needed.
To minimize risk and speed processing, when a DataNode stops signaling the NameNode, that DataNode is removed from the cluster and operations continue without that DataNode. If that DataNode later becomes operational, it is allocated to a new cluster.
HDFS provides flexible data access files through various interfaces: a native Java API is provided with HDFS, while a C language wrapper is available for the Java API, plus an HTTP browser can be used to browse the files of an HDFS instance.