Mixed data source

The Mixed Data Source example consists of two MapReduce applications that have different data sources and destinations for input and output running within the MapReduce framework in IBM® Spectrum Symphony.

The Mixed Data Source example implements the Simplified PageRank algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of measuring its relative importance within the set. The input data is a mini access log, with a <url,referrer,time> schema. The output is the number of page views of each URL in the log, having the schema <url,pageview>.

The program only supports the Hadoop 0.21.0 API and Oracle or MySQL database. Before running this example within the MapReduce framework in IBM Spectrum Symphony, make sure the driver JAR has been copied to $PMR_HOME/version/os_type/lib on each host. You also need to add the Hadoop settings to IBM Spectrum Symphony.

HDFS to database

HDFSToDBCountPageView is a MapReduce program that reads input data from a HDFS file, and writes data to a database. The program creates the necessary tables, populates an input file, and then runs the MapReduce job.

The following command shows the usage of HDFSToDBCountPageView within the MapReduce framework in IBM Spectrum Symphony:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView -h
The following command runs the example:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView path_of_input driver_name db_url db_user db_pwd

where path_of_input means the program will generate data to this path and read input data from this path while running. The last four parameters specify which database the result will output to.

For example, if the database is Oracle, the command will be as follows:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView /tmp/Populated_Access_Data oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@127.0.0.1:1521:orcl dbuser dbpwd
For MySQL, the command will be as follows:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView /tmp/Populated_Access_Data com.mysql.jdbc.Driver jdbc:mysql://127.0.0.1:3306/dbname dbuser dbpwd
The result is output to the table named Pageview in the database. The console displays the comparison of input and output after the job is done. For example:
  • - totalPageview=71
  • - sumPageview=71

The totalPageview is the total number of access logs and sumPageview is the sum of url page views.

Database to HDFS

DBToHDFSCountPageView reads input data from a database and writes output data into an HDFS file. This application has the same function as HDFSToDBCountPageView.

The following command shows the usage of DBToHDFSCountPageView:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView -h
The following command runs the example:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView driver_name db_url db_user db_pwd path_of_output

The first four variables specify which database the data will read from. The data will be generated automatically. The path_of_output variable means the program will output the result to this path.

For example, if the database is Oracle, the command will be as follows:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@127.0.0.1:1521:orcl dbuser dbpwd /tmp/PageviewDataDir
For MySQL, the command will be as follows:
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView com.mysql.jdbc.Driver jdbc:mysql://127.0.0.1:3306/dbname dbuser dbpwd /tmp/PageviewDataDir
The result is output to the path /tmp/PageviewDataDir in HDFS. The console displays the comparison of input and output after the job is done. For example:
  • - totalPageview=82
  • - sumPageview=82

The totalPageview is the total number of access logs and sumPageview is the sum of url page views.

Source code overview

The source code for the Mixed Data Source examples is included in jar file $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar under the src subdirectory. The following table shows the JAR file list.

File Description
HDFSToDBCountPageView.java The main program that reads input data from HDFS and writes output data into the database.
HDFSDataInputFormat.java The InputFormat reads HDFS and supports the HDFSToDBCountPageView
OracleDataOutputFormat.java The OutputFormat supporting the HDFSToDBCountPageView and handles Oracle SQL statement issues when inserting data that is not supported by the original DBOutputFormat
DBToHDFSCountPageView.java The main program that reads input data from the database and writes data into HDFS
HDFSDataOutputFormat.java The OutputFormat that writes to HDFS and supports the DBToHDFSCountPageView

The dependent packages required for compiling are located at ${PMR_HOME}/version/os_type/lib/hadoop-0.21.0.