Configuring Hadoop

You must adjust your Hadoop configuration settings to integrate your Hadoop installation with InfoSphere® Information Server.

About this task

Perform this task before or after you install InfoSphere Information Server.

Procedure

Adjust your Hadoop cluster configuration settings.

Refer to the Hadoop distribution documentation for information about updating the settings. You should consider adjusting the following settings:

Table 1. Hadoop cluster configuration settings
Parameter	Description	Default value	Recommended value
yarn.log-aggregation-enable	Manages YARN log files. Set this parameter to false if you want the log files stored in the local file system. If the parameter is set to true, then log files are moved to HDFS from the local file system when the job completes.	true	false
yarn.nodemanager.log.retain-seconds	Specifies the duration in seconds that Hadoop retains container logs. If log files start using a large amount of nodes, reduce the value for this parameter. Note that Hadoop requires no separator to be used after the thousands. For example, to specify a value of 10,800, you must specify 10800.	10800
yarn.nodemanager.pmem-check-enabled	Determines if physical memory limits exist for containers. If this parameter is set to the default value of true, the job is stopped if a container uses more than the physical memory limit that you specify. Set this parameter to false if you do not want jobs to fail when the containers consume more memory than they are allocated. If the parameter is set to true, make sure that he default setting of APT_YARN_CONTAINER_SIZE is appropriate for most of your jobs. For jobs that consume large amounts of memory, such as joins or lookups, set APT_YARN_CONTAINER_SIZE to a large enough value so the jobs will not be stopped. You can also consider setting APT_YARN_CONTAINER_SIZE_AUTO to true to allow InfoSphere DataStage® to estimate the required container sizes.	true
yarn.nodemanager.resource.memory-mb	Sets the amount of physical memory that can be allocated for containers. Check this value and verify that it is set accurately based on the available memory of the data nodes in the cluster. If this parameter is not set high enough, the nodes run out of containers when a large amount of memory is still available on the computer.	8192 MB
yarn.nodemanager.vmem-check-enabled	Determines if virtual memory limits exist for containers. If this parameter is set to true, the job is stopped if a container is using more than the virtual limit that you specify. Set this parameter to false if you do not want jobs to fail when the containers consume more memory than they are allocated.	true
yarn.nodemanager.vmem-pmem-ratio	Sets the ratio of virtual memory to physical memory limits for containers. If yarn.nodemanager.vmem-check-enabled is set to true, jobs might be stopped by YARN if the ratio of the virtual memory that a container consumes compared to the physical memory is greater than the ratio that you specify. The first number in the value is the virtual memory, the second is the physical memory. For example the default value of 2.1 sets a ratio of twice as much virtual memory as physical memory.	2.1
yarn.resourcemanager.nodemanagers.heartbeat-interval-ms	Controls the start time for parallel jobs. For clusters that have fewer than 50 nodes, 1000 ms is often too long and leads to a longer start time for parallel jobs. You can set this value to 50 milliseconds to ensure parallel jobs start in a timely manner.	1000 ms	50 milliseconds.
yarn.resourcemanager.work-preserving-recovery.enabled	Enables the Resource Manager's work preserving recovery capabilities. Set this value to true in order for InfoSphere DataStage jobs to run smoothly when Hadoop is configured with a high availability topology.	Defaults vary between distributions of Hadoop.	true
yarn.scheduler.capacity.maximum-am-resource-percent	Specifies the maximum percentage of resources for all queues in the cluster that can be used to run application masters, and controls the number of concurrent active applications. Limits on each queue are directly proportional to their queue capacities and user limits. The value is specified as a float. For example, 0.5 = 50 percent. You must a use a period for the separator. You can override this setting for individual queues by setting the yarn.scheduler.capacity.queue-path.maximum-am-resource-percent parameter. Increase the default value if you want to run many jobs concurrently. For example, you should set a high higher value if you have an InfoSphere Data Click activity that offloads to many tables and runs many concurrent job processes.	Defaults vary between distributions of Hadoop.
yarn.scheduler.capacity.`queue-path`.maximum-am-resource-percent	Specifies the maximum percentage of resources for a single queue in the cluster that can be used to run application masters, and controls the number of concurrent active applications. Specify the queue as part of the parameter name, replacing the variable `queue-path`. This parameter is used to override yarn.scheduler.capacity.maximum-am-resource-percent, which specifies the maximum percent for all queues.	Defaults vary between distributions of Hadoop.
yarn.scheduler.increment-allocation-mb	This value indicates how much the container size can be incremented. If you submit tasks with resource requests lower than the minimum-allocation value, the requests are set to the minimum-allocation value. If you submit tasks with resource requests that are not multiples of the increment-allocation value, the requests are increased to the nearest multiple of this increment parameter value. Set the value of this parameter to be less than or equal to the value of the yarn.scheduler.minimum-allocation-mb parameter.	512 MB on Cloudera
yarn.scheduler.minimum-allocation-mb	This parameter helps conserve resources on the cluster by setting the minimum amount of memory that can be requested for a container. The amount of memory that is required for containers in an InfoSphere Information Server job varies depending on the job complexity, stages used, and data volume. Parallel jobs tend to require much less memory than the typical default for this setting, causing resources to be wasted on the cluster. The default container size for parallel processes is 64 MB. Consider reducing the value of yarn.scheduler.minimum-allocation-mb if you find that parallel jobs are using significantly less memory than the cluster setting. Note: If you changing the yarn.scheduler.minimum-allocation-mb value with Ambari-2.1, you must specify whether the changes should be applied to the MapReduce specific resource settings. If you are significantly reducing the value of yarn.scheduler.minimum-allocation-mb, do not change the MapReduce values based on the new value, because it could cause MapReduce jobs to fail.	1024 MB for most Hadoop distributions	256 MB or less.

Run a sample MapReduce job on the cluster to verify that the job completes and that the configuration was successfully updated.