Introduction

The Hadoop ecosystem consists of many open source projects. One of the central components is the Hadoop Distributed File System (HDFS).

HDFS is a distributed file system designed to run on the commodity hardware. Other related projects facilitate workflow and the coordination of jobs, support data movement between Hadoop and other systems, and implement scalable machine learning and data mining algorithms. HDFS lacks the enterprise class functions necessary for reliability, data management, and data governance. IBM®’s General Parallel File System (GPFS) is a POSIX compliant file system that offers an enterprise class alternative to HDFS.

There are a many similar tuning guides available for native HDFS. However, when you apply those tuning steps over IBM Storage Scale, usually you cannot get the best performance because of natural design difference between HDFS and IBM Storage Scale. With this section the customers can tune different Hadoop components when they run Hadoop over IBM Storage Scale and HDFS Transparency so that they get good performance.

All the tuning configurations mentioned in this section are for Hadoop 2.7.x and Hadoop 3.0.x.