How to improve LOAD HADOOP performance

You can improve the efficiency of loading data into the Hadoop environment by following some guidelines on using the LOAD HADOOP statement.

Update num.map.tasks

During LOAD processing, the Hadoop MapReduce framework is used, which runs jobs by using multiple mappers and reducer tasks. You can control the number of map tasks that are created during the LOAD with the num.map.tasks parameter. You can improve the performance by increasing the number of map tasks that are assigned to the LOAD.

To update the default value of num.map.tasks, include the WITH LOAD PROPERTIES clause in your LOAD statement. If your files are part of a file system that is remote to the cluster, and not on the distributed file system, set the value to at least the number of data nodes that are in your cluster. But, be aware of the configuration and connection parameters of your remote system, and the number of slots that are available for map tasks as you modify the num.map.tasks parameter.

The num.map.tasks property influences the number of files that are created on the distributed file system (DFS), after the data is loaded. Block size can affect performance when you load files from DFS. Large source files in the distributed file system (DFS) are broken into smaller blocks. Each map task processes one or more blocks. LOAD uses as many map tasks as possible, limited by the property num.map.tasks.

Consider file sizes in your LOAD strategy

The number of files that you load does impact the performance of the LOAD HADOOP statement. You should consider how, or from where to load data based on the size of the files.

Large files

The most effective LOAD strategy is to copy data from a remote data source to the distributed file system, and then run the LOAD HADOOP statement pointing to that DFS data.

Files on a remote data source are not splittable. Therefore, a map task processes one or more whole files, which can take longer to perform. For example, if you have n TB of data to load that is contained in one file, the data is loaded by a single map task, which takes some time to load. If that same data is split between m tasks, then m map tasks are used to load the data, which is more efficient. Moving the large files to DFS first and then running LOAD will allow the files to be split, and takes advantage of running parallel map tasks.

Loading directly from a remote file system might conserve space in the distributed file system, but you lose the benefits of LOAD processing in parallel.

Small files

If you load data directly from a remote source, or load data from a directory on DFS, performance is not impacted.

For many small files, you might want one map task to process many files, so that output is put into one block size.

Configure YARN

When a LOAD HADOOP statement runs, it starts a MapReduce job that is managed by the YARN service. YARN allocates an Application Master (AM) container and a container for each map and reduce task that runs. The job starts when at least two containers with the minimum required memory are available. As additional containers and memory become available, they are allocated, run, and are returned to YARN until the application completes.

If containers with the minimum memory requirement are not available, the application waits until these resources are available. This can delay the load job.

The load job runs in the default YARN queue unless another queue has been created and configured to be the default for the user who is running the job. The load job runs under the bigsql administrator user ID unless impersonation has been enabled.

Long running processes that use YARN containers can block other jobs from acquiring necessary resources. You can use the YARN scheduler web page to see the status of the YARN queue and determine which containers are in use. You can access this page from Ambari by selecting the quick links button on the YARN service and clicking Resource Manager > Scheduler.

On the YARN scheduler page, you can see whether there are applications waiting for resources. If not enough memory or containers are available for the default queue, you can create a queue for the load job that will not be impacted by other long-running processes. Some examples of long-running processes that can impact YARN resource availability are TEZ, LLAP, and SPARK.

If you encounter a delayed load job in the YARN queue (with the message Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded.), you might need to increase the amount of AM resources. You can do this from the Ambari YARN Queue Manager View. Increase the value for Maximum AM Resources. This enables the application master to proceed without delay.