Using mixed data sources

This tutorial guides you through the process of using different data sources and destinations for input and output within the MapReduce framework in IBM® Spectrum Symphony.

About this task

The sample consists of two MapReduce programs:

HDFSToDBCountPageView: A Java executable to read input data from an HDFS file and write data to a database.
DBToHDFSCountPageView: A Java executable to read input data from a database and write output data to an HDFS file.

In this tutorial, you will complete these tasks:

Build the sample
Package the sample
Run the sample
Walk through the code

Procedure

Build the sample.
Note: This sample only supports the Hadoop 0.21.0 API and the Oracle or MySQL database.
1. Before you build the sample, ensure that you complete the following steps:
  1. On each host, copy the driver JAR to the $PMR_HOME/version/os_type/lib directory. The packages required for compiling are located at the $PMR_HOME/version/os_type/lib/hadoop-0.21.0 directory.
  2. If required, add Hadoop settings to IBM Spectrum Symphony.
2. Change to the root directory under the directory in which you installed IBM Spectrum Symphony Developer Edition. For example, if you used the default installation directory, change to the opt/ibm/platformsymphonyde/de/de731 directory.
3. Set the environment:
  - (csh) source cshrc.platform
  - (bash) . profile.platform
4. Change to the $SOAM_HOME/mapreduce/version/samples/MixedData directory and run the make command:
```
make
```
  The sample is compiled with the Hadoop 0.21.0 API.
Package the sample.
You must package the files to create a service package. When you build the sample, the service package is automatically created for you.
1. The service package for your Hadoop version pmr-mixeddata-examples-0.21.0.jar is created in the following directory:
  $SOAM_HOME/mapreduce/version/samples/pmr-mixeddata-examples
Run the sample:
1. To read data from a HDFS file and write data to a database, run HDFSToDBCountPageView using the following syntax:
  $ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView input_path driver_name db_url db_user db_pwd
  where:
  - input_path: Specifies the HDFS path to which the program generates data and from which it reads input data while running the MapReduce job.
  - driver_name: Specifies the driver for the database to which output data is written.
  - db_url: Specifies the URL for the database to which output data is written.
  - db_user: Specifies the user name on the database server to which output data is written.
  - db_pwd: Specifies the password for the user name on the database server to which output data is written.
  For example:
  - On Oracle:
    $ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView /tmp/Populated_Access_Data oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@127.0.0.1:1521:orcl dbuser dbpwd
  - On MySQL:
    $ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView /tmp/Populated_Access_Data com.mysql.jdbc.Driver jdbc:mysql://127.0.0.1:3306/dbname dbuser dbpwd
  The result is written to the table named Pageview in the database. The console displays the comparison of input and output after the job is done. For example:
  - - totalPageview=71
  - - sumPageview=71
  where:
  - totalPageview specifies the number of access logs.
  - sumPageview specifies the sum of URL page views.
2. To read data from a database and write data to an HDFS file, run DBToHDFSCountPageView using the following syntax:
  $ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView driver_name db_url db_user db_pwd output_path
  where:
  - driver_name: Specifies the driver for the database from which input data is read.
  - db_url: Specifies the URL for the database from which input data is read.
  - db_user: Specifies the user name on the database server from which input data is read.
  - db_pwd: Specifies the password for the user name on the database server from which input data is read.
  - output_path: Specifies the HDFS path to which the program generates output data.
  For example:
  - On Oracle:
    $ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@127.0.0.1:1521:orcl dbuser dbpwd /tmp/PageviewDataDir
  - On MySQL:
    $ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView com.mysql.jdbc.Driver jdbc:mysql://127.0.0.1:3306/dbname dbuser dbpwd /tmp/PageviewDataDir
  The result is written to the path /tmp/PageviewDataDir in HDFS. The console displays the comparison of input and output after the job is done. For example:
  - - totalPageview=82
  - - sumPageview=82
  where:
  - totalPageview specifies the number of access logs.
  - sumPageview specifies the sum of URL page views.

Walk through the code.

The source code for the Mixed Data Source examples is included in the pmr-mixeddata-examples-0.21.0.jar file, which includes the following files:

HDFSToDBCountPageView.java: The main program that reads input data from HDFS and writes output data to the database.
HDFSDataInputFormat.java: The InputFormat reads HDFS and supports the HDFSToDBCountPageView.
OracleDataOutputFormat.java: The OutputFormat that supports the HDFSToDBCountPageView and handles Oracle SQL statement issues when inserting data which is not supported by the original DBOutputFormat.
DBToHDFSCountPageView.java: The main program that reads input data from the database and writes data to HDFS.
HDFSDataOutputFormat.java: The OutputFormat that writes to HDFS and supports the DBToHDFSCountPageView.

Locate the code samples on Linux® 64-bit hosts:

Table 1. Code samples
File	Location of code sample
pmr-mixeddata-examples-0.21.0.jar	$SOAM_HOME/mapreduce/`version`/`os_type`/samples/
HDFSToDBCountPageView.java	$SOAM_HOME/mapreduce/`version`/samples/MixedData/com/platform/mapreduce/examples/mixeddata/HDFSToDBCountPageView.java
HDFSDataInputFormat.java	$SOAM_HOME/mapreduce/`version`/samples/MixedData/com/platform/mapreduce/examples/mixeddata/HDFSDataInputFormat.java
OracleDataOutputFormat.java	$SOAM_HOME/mapreduce/`version`/samples/MixedData/com/platform/mapreduce/examples/mixeddata/OracleDataOutputFormat.java
DBToHDFSCountPageView.java	$SOAM_HOME/mapreduce/`version`/samples/MixedData/com/platform/mapreduce/examples/mixeddata/DBToHDFSCountPageView.java
HDFSDataOutputFormat.java	$SOAM_HOME/mapreduce/`version`/samples/MixedData/com/platform/mapreduce/examples/mixeddata/HDFSDataOutputFormat.java

Understand what the sample does.
The Mixed Data Source example implements the Simplified PageRank algorithm, which assigns a numerical weighting to each element of a hyperlinked set of documents (such as the World Wide Web) to measure its relative importance within the set. The input data is a mini access log, with a <url,referrer,time> schema. The output is the number of page views of each url in the log, with the schema <url,pageview>.
- HDFSToDBCountPageView reads input data from a HDFS file and writes output data to a database. The program creates the necessary tables, populates an input file, and runs the MapReduce job. Based on specified parameters, the program generates data to the specified path, reads input data from that path, and generates the results to the specified database.
- DBToHDFSCountPageView reads input data from a database and writes output data to an HDFS file. This application has the same function as HDFSToDBCountPageView. Based on specified parameters, the program reads data from the specified data and generates the results to another specified path.