Submitting MapReduce jobs

Submit MapReduce jobs from the browser-based cluster management console or from the mrsh utility.

Procedure

  1. Job submission from the cluster management console.
    1. Log on to the cluster management console:
      1. Access the cluster management console available by default at https://host_name:8443/platform.

        If you are unsure about which host the cluster management console is running on, run the egosh service list command and check the resource for the WEBGUI service.

      2. Log on with your user account. The default administrator account is Admin with password Admin.
    2. From the cluster management console Dashboard, select Workload > MapReduce > Jobs.
    3. Click New.

      The Submit Job window appears.

    4. Enter parameters for the job:
      1. Enter the following details:
        1. Application name: Choose an application from the drop-down list.
        2. Job priority: Set the priority for the job to a value between 1 and 10000 (default 5000).
        3. Application JAR file: Upload the application JAR file that is to be used for the job:
          • Click Add Local File to upload an application JAR file from your local host.
          • Click Add Server File to upload an application JAR file from the MapReduce server.
        4. Main class: Enter the class that is to be invoked.
        5. Main class options: Enter more options for the main class.
      2. To enter more job configuration parameters, click Add, choose an option from the drop-down list, and enter a value for the parameter. You can enter as many parameters as you want.

        For a list of supported properties and their default values, see Configuration properties for MapReduce jobs.

    5. Click Submit.
      Note: If you set the DFS_GUI_HOSTNAME variable, you can view job output through the cluster management console from the HDFS web interface (click Resources > Storage (HDFS) ).
  2. Job submission from the mrsh utility.
    Besides submitting jobs by running your programs with Java™, you can use the mrsh utility to submit MapReduce jobs. The mrsh utility is a shell script that automatically sets up the environment for you.
    1. To submit a job using mrsh, from the command line, enter:

      mrsh jar jarfile [classname] [-Dproperty=value Dproperty=value] [args]

      where:
      • jarfile specifies the file name of the application packaged as a JAR file that includes the MapReduce code.

      • (Optional) classname specifies the class to be invoked. If the class is not specified, the class that is specified by the JAR manifest is run.

      • (Optional) -Dproperty=value specifies settings for a job :
        • property specifies the name of a job configuration property.

        • value specifies the value for the job configuration property.

        For a list of supported properties and their default values, see Configuration properties for MapReduce jobs.

  3. Job submission for samples.

    The MapReduce framework in IBM® Spectrum Symphony provides sample MapReduce applications and jobs, including Hadoop samples. All code samples and binary packages are placed under $PMR_HOME/version/os_type/samples/.

    Before running the samples, source the environment in the IBM Spectrum Symphony installation directory (/opt/ibm/spectrumcomputing by default):

    (bsh) . /opt/ibm/spectrumcomputing/profile.platform

    (csh) source /opt/ibm/spectrumcomputing/cshrc.platform

    Use a syntax similar to the following descriptions for each sample:
    • The Grep sample extracts matching strings from text files and counts how many times they occur. For example:

      mrsh jar $PMR_HOME/version/os_type/samples/hadoop-examples-2.7.2.jar grep indir outdir regex [group]

    • The Streaming utility creates and runs MapReduce jobs with any executable or script as the mapper, the reducer, or both. For example:

      mrsh jar $PMR_HOME/version/os_type/lib/hadoop-2.7.2/hadoop-streaming-2.7.2.jar

      Note: This example uses the JAR file for Hadoop 2.7.2. Use the JAR file corresponding to the Hadoop version that you are running. If you are running Cloudera, use the appropriate Cloudera JAR file.

      -input myInputDirs

      -output myOutputDir

      -mapper mapper.py

      -reducer reducer.py

      -file $PMR_HOME/version/os_type/samples/mapper.py

      -file $PMR_HOME/version/os_type/samples/reducer.py

      Parameter -file is used to deploy your script file from the local host to all compute hosts when the file exists only locally. You can also use an absolute path like "-reducer /usr/bin/wc" to specify a common command.

    • The DBCountPageView sample uses DBInputFormat for reading the input data from a database, and DBOutputFormat for writing the data to the database.

      The program first creates the necessary tables, populates the input table, and runs the MapReduce job. The input data is a mini access log, with a <url,referrer,time> schema. The output is the number of page views of each URL in the log, having the schema <url,pageview>.

      1. Copy a database driver (for example, ojdbc14.jar) into the $PMR_HOME/version/os_type/lib/ directory.
      2. Submit a job using the following syntax:
        dbcount driver database_url user password
        For example, for Oracle:
        mrsh jar $PMR_HOME/version/os_type/samples/hadoop-examples-2.7.2.jar dbcountoracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@172.20.1.2:1521:orcl user password
      3. Log on to the database to check the results:
        Select * from MyAccess;
        Select * from Pageview;