Troubleshooting MapReduce

Review the following information for troubleshooting some common MapReduce issues.

Tip: For more troubleshooting information for IBM® Spectrum Symphony, see Troubleshooting and FAQs.

Authentication failed during job submission

The following error occurs when submitting a MapReduce job:

Exception caught: com.platform.symphony.soam.SoamException: Security error: Authentication failed. Incorrect user name or password.

To resolve the error, edit the $PMR_HOME/conf/pmr-site.xml file and set the correct user name or password. For example:

<property>
  <name>mapreduce.job.login.user</name>
  <value>Admin</value>
  <description>The user to submit job.</description>
</property>

<property>
  <name>mapreduce.job.login.password</name>
  <value>Admin</value>
  <description>The password of submit user.</description>
</property>

Exception thrown when JVM option is set

If the maximum heap size specified as JVM options in the pmr-env.sh configuration file or the application profile is set to a value that conflicts with the io.sort.mb property, a NullPointerException or IndexOutOfBoundsException is thrown.

The io.sort.mb property (or mapreduce.task.io.sort.mb in Hadoop 2.4.x) specifies the total amount of buffer memory to use while sorting files. By default, this value is 100 MB. If your maximum heap size is set to a lesser value than the total amount of buffer memory, the application throws an exception. To avoid this issue, you must set smaller values for both the maximum heap size and the total amount of buffer memory. If you are using the default 100 MB for io.sort.mb, set 200 MB or higher as the maximum heap size in the pmr-env.sh file, or in the application profile. Note that the size of the io.sort.mb property times two cannot exceed the maximum heap size.

Client shows no job progress and appears to hang

If the application manager goes down and then recovers, the client may not be able to detect it and may assume the application manager is lost, even though it has actually been successfully restarted. The cluster management console eventually shows that all of the tasks have finished, but the job is still open. (Since the client cannot get information from the new application manager that all tasks are finished, it cannot close the job.)

To enable the client to check back with the application manager more frequently and thereby determine if there is a new application manager, try decreasing the value of the TCP connection attribute PLATCOMMDRV_TCP_KEEPALIVE_TIME on the client side.

To adjust the value of PLATCOMMDRV_TCP_KEEPALIVE_TIME, do one of the following:

Set the attribute as an environmental variable. For example:

For tcsh:

setenv PLATCOMMDRV_TCP_KEEPALIVE_TIME 180

For bash:

export PLATCOMMDRV_TCP_KEEPALIVE_TIME=180

Add the following line to the end of file pmr-env.sh at $PMR_HOME/conf/:
```
export PLATCOMMDRV_TCP_KEEPALIVE_TIME=180
```

Counters are different from Hadoop

The counter information displayed within the MapReduce framework in IBM Spectrum Symphony when a job is finished is different from Hadoop. Because of the different implementation and/or limitation, the MapReduce framework in IBM Spectrum Symphony removes the following counters:

Shuffle Errors
       BAD_ID=0                  
       WRONG_LENGTH=0        
       WRONG_MAP=0           
       WRONG_REDUCE=0       
Job Counters
       Total time spent by all maps waiting after reserving slots
       (ms)=0             
       Total time spent by all reduces waiting after reserving slots
       (ms)=0           
        Rack-local map tasks=1                                             
       SLOTS_MILLIS_MAPS=6859                                        
       SLOTS_MILLIS_REDUCES=4874                                    
        Launched map tasks=1                                              
       Launched reduce tasks=1

At the same time, it adds the following counter:

Shuffle Errors
       WRONG_PATH=0

where WRONG_PATH means that a reduce task tried to fetch an intermediate file that does not exist.

Duplicate messages in task logs

Messages relating to the combine function for a MapReduce job are duplicated in the map task and reduce task logs.

This occurs because setting the combiner class within the MapReduce framework in IBM Spectrum Symphony causes the combiner class to run twice:

After the map task and before output from the map task is sent to the host that will run the reduce task, and
After the host that ran the reduce task gets data from the map and before it runs the reduce task.

As a result, messages from the combine function appear in the map task and reduce task logs.