Mixed data source
The Mixed Data Source example consists of two MapReduce applications that have different data sources and destinations for input and output running within the MapReduce framework in IBM® Spectrum Symphony.
The Mixed Data Source example implements the Simplified PageRank algorithm that assigns a
numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web,
with the purpose of measuring its relative importance within the set. The input data is a mini
access log, with a <url,referrer,time>
schema. The output is the number of page
views of each URL in the log, having the schema <url,pageview>
.
The program only supports the Hadoop 0.21.0 API and Oracle or MySQL database. Before running this example within the MapReduce framework in IBM Spectrum Symphony, make sure the driver JAR has been copied to $PMR_HOME/version/os_type/lib on each host. You also need to add the Hadoop settings to IBM Spectrum Symphony.
HDFS to database
HDFSToDBCountPageView is a MapReduce program that reads input data from a HDFS file, and writes data to a database. The program creates the necessary tables, populates an input file, and then runs the MapReduce job.
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView -h
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView path_of_input driver_name db_url db_user db_pwd
where path_of_input means the program will generate data to this path and read input data from this path while running. The last four parameters specify which database the result will output to.
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView /tmp/Populated_Access_Data oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@127.0.0.1:1521:orcl dbuser dbpwd
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.HDFSToDBCountPageView /tmp/Populated_Access_Data com.mysql.jdbc.Driver jdbc:mysql://127.0.0.1:3306/dbname dbuser dbpwd
Pageview
in the database. The console
displays the comparison of input and output after the job is done. For example:- - totalPageview=71
- - sumPageview=71
The totalPageview is the total number of access logs and sumPageview is the sum of url page views.
Database to HDFS
DBToHDFSCountPageView reads input data from a database and writes output data into an HDFS file. This application has the same function as HDFSToDBCountPageView.
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView -h
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView driver_name db_url db_user db_pwd path_of_output
The first four variables specify which database the data will read from. The data will be generated automatically. The path_of_output variable means the program will output the result to this path.
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@127.0.0.1:1521:orcl dbuser dbpwd /tmp/PageviewDataDir
$ mrsh jar $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar com.platform.mapreduce.examples.mixeddata.DBToHDFSCountPageView com.mysql.jdbc.Driver jdbc:mysql://127.0.0.1:3306/dbname dbuser dbpwd /tmp/PageviewDataDir
- - totalPageview=82
- - sumPageview=82
The totalPageview is the total number of access logs and sumPageview is the sum of url page views.
Source code overview
The source code for the Mixed Data Source examples is included in jar file $PMR_HOME/version/os_type/samples/pmr-mixeddata-examples-0.21.0.jar under the src subdirectory. The following table shows the JAR file list.
File | Description |
---|---|
HDFSToDBCountPageView.java | The main program that reads input data from HDFS and writes output data into the database. |
HDFSDataInputFormat.java | The InputFormat reads HDFS and supports the HDFSToDBCountPageView |
OracleDataOutputFormat.java | The OutputFormat supporting the HDFSToDBCountPageView and handles Oracle SQL statement issues when inserting data that is not supported by the original DBOutputFormat |
DBToHDFSCountPageView.java | The main program that reads input data from the database and writes data into HDFS |
HDFSDataOutputFormat.java | The OutputFormat that writes to HDFS and supports the DBToHDFSCountPageView |
The dependent packages required for compiling are located at ${PMR_HOME}/version/os_type/lib/hadoop-0.21.0.