Configuring jobs to run on Hadoop

To run jobs on Hadoop, an administrator must create and set the APT_YARN_CONFIG environment variable for each project.

Before you begin

Verify that the Linux computer that you are running the jobs on has Java™ Development Kit (JDK) 1.7 installed. To verify that JDK 1.7 is installed, log on to your Linux computer as the InfoSphere® DataStage® administrator. For example:
su - dsadm , java -version
Verify that the version of Java is 1.7. If it is not, install Java Development Kit 1.7.

About this task

The APT_YARN_CONFIG environment variable provides a path for InfoSphere DataStage to read the yarnconfig.cfg file, which specifies all the environment variables that you need to run InfoSphere Information Server on Hadoop. To take advantage of the resource management functionality of Hadoop when running jobs, you must do the following:
  • Set APT_YARN_CONFIG.
  • Ensure that APT_YARN_CONFIG points to a yarncongfig.cfg file where APT_YARN_MODE is set to the default value of true or 1.
Otherwise, the jobs will not run on Hadoop, but will run in the standard manner, without using the YARN resource management.

Procedure

  1. Open the InfoSphere DataStage and QualityStage Administrator client.
  2. In the Administrator window, click the Project tab.
  3. Select the project that you want to run on Hadoop. The default project is dstage1.

    For InfoSphere Information Analyzer, select the InfoSphere DataStage project that is set in the InfoSphere Information Analyzer global or project properties to be used by InfoSphere Information Analyzer. The default project is ANALYZERPROJECT.

  4. Click Properties.
  5. On the General tab, click Environment, and then click User Defined.
  6. Enter the following information to define an environment variable:
    1. For the name, enter APT_YARN_CONFIG.
    2. For the type, enter string.
    3. In the Prompt field, enter DataStage Hadoop Configuration file.
    4. In the Value field, enter /IS_install/Server/PXEngine/etc/yarn_conf/yarnconfig.cfg. IS_install is the InfoSphere Information Server installation directory. The default directory is /opt/IBM/InformationServer/.

What to do next

Run a sample job with the new environment variable set, and verify that the job runs successfully. In the Operations Console, verify that the job logs contain messages that indicate that the job successfully connected to the YARN Application Master. If Hadoop is set up successfully, the project that contains the job that you created includes (Hadoop) after the project name.