Troubleshooting MapReduce
Review the following information for troubleshooting some common MapReduce issues.
Authentication failed during job submission
The following error occurs when submitting a MapReduce job:
Exception caught: com.platform.symphony.soam.SoamException: Security error: Authentication failed. Incorrect user name or password.
<property>
<name>mapreduce.job.login.user</name>
<value>Admin</value>
<description>The user to submit job.</description>
</property>
<property>
<name>mapreduce.job.login.password</name>
<value>Admin</value>
<description>The password of submit user.</description>
</property>
Exception thrown when JVM option is set
If the maximum heap size specified as JVM options in the pmr-env.sh configuration file or the application profile is set to a value that conflicts with the io.sort.mb property, a NullPointerException or IndexOutOfBoundsException is thrown.
The io.sort.mb property (or mapreduce.task.io.sort.mb in Hadoop 2.4.x) specifies the total amount of buffer memory to use while sorting files. By default, this value is 100 MB. If your maximum heap size is set to a lesser value than the total amount of buffer memory, the application throws an exception. To avoid this issue, you must set smaller values for both the maximum heap size and the total amount of buffer memory. If you are using the default 100 MB for io.sort.mb, set 200 MB or higher as the maximum heap size in the pmr-env.sh file, or in the application profile. Note that the size of the io.sort.mb property times two cannot exceed the maximum heap size.
Client shows no job progress and appears to hang
If the application manager goes down and then recovers, the client may not be able to detect it and may assume the application manager is lost, even though it has actually been successfully restarted. The cluster management console eventually shows that all of the tasks have finished, but the job is still open. (Since the client cannot get information from the new application manager that all tasks are finished, it cannot close the job.)
To enable the client to check back with the application manager more frequently and thereby determine if there is a new application manager, try decreasing the value of the TCP connection attribute PLATCOMMDRV_TCP_KEEPALIVE_TIME on the client side.
- Set the attribute as an environmental variable. For example:
- For tcsh:
setenv PLATCOMMDRV_TCP_KEEPALIVE_TIME 180
- For bash:
export PLATCOMMDRV_TCP_KEEPALIVE_TIME=180
- For tcsh:
- Add the following line to the end of file pmr-env.sh at
$PMR_HOME/conf/:
export PLATCOMMDRV_TCP_KEEPALIVE_TIME=180
Counters are different from Hadoop
Shuffle Errors
BAD_ID=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Job Counters
Total time spent by all maps waiting after reserving slots
(ms)=0
Total time spent by all reduces waiting after reserving slots
(ms)=0
Rack-local map tasks=1
SLOTS_MILLIS_MAPS=6859
SLOTS_MILLIS_REDUCES=4874
Launched map tasks=1
Launched reduce tasks=1
Shuffle Errors
WRONG_PATH=0
where WRONG_PATH
means that a reduce task tried to fetch an intermediate file
that does not exist.
Duplicate messages in task logs
Messages relating to the combine function for a MapReduce job are duplicated in the map task and reduce task logs.
- After the map task and before output from the map task is sent to the host that will run the reduce task, and
- After the host that ran the reduce task gets data from the map and before it runs the reduce task.
As a result, messages from the combine function appear in the map task and reduce task logs.