Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Sqoop: Big data conduit between NoSQL and RDBMS

Surajit Paul (surajit.paul@in.ibm.com), Advisory Consultant, IBM
Surajit Paul
Surajit Paul is a techno-functional consultant with a decade of experience in product development and consulting in various industry verticals, including supply-chain management, retail distribution, and investment banking. He is PMI Certified Project Management Professional, APICS Certified Supply Chain Professional, and ASQ Certified Six Sigma Green Belt.

Summary:  Sqoop is an integral part of a Hadoop ecosystem, helping transfer data between NoSQL data storage and the traditional RDBMS. Numerous technical articles have been published featuring the Sqoop command-line interface (CLI) usage. However, as of Sqoop 1.4.3, there is not much insight publicly available about the usage of the Sqoop Java API. This article covers the usage of the Sqoop CLI, with additional emphasis on the Sqoop Java API, using an example of data from the Bombay Stock Exchange. The article is intended to provide preliminary exposure to technical architects, solution architects, technical managers, consultants, data scientists, technical leads, and developers interested in and working in the big data space.

Date:  23 Jul 2013
Level:  Intermediate PDF:  A4 and Letter (315 KB | 17 pages)Get Adobe® Reader®

Activity:  2904 views
Comments:  

Apache Sqoop is designed for efficiently transferring data between a Hadoop ecosystem and structured data storage in a relational database management system such as MySQL, Oracle, MS SQL, Postgre SQL, DB2, and more. As an integral component of a Hadoop ecosystem, Sqoop launches a MapReduce job, which is extremely fault-tolerant distributed parallel computing, to execute the tasks. Another benefit of Sqoop is its entirely automated process of transferring large volumes of structured or semi-structured data.


Figure 1. Sqoop data flow architecture
Sqoop data flow architecture

In this article, we will explore various means of executing the tasks, such as using the Sqoop CLI and the Java™ API. We will see how to import data from RDBMS, manipulate the data in the Hadoop environment, and export the manipulated data back to the RDBMS tables. We will use 12 years of Sensex data for indices BSE30 and BSE FMCG for this analysis.

Prerequisites

The example is executed on a pseudo-distributed Hadoop ecosystem configured on the Red Hat Enterprise Linux® 6 (RHEL6) OS. The detailed discussion on the configuration of the relevant frameworks — such as Apache Hadoop, ZooKeeper, HBase, Hive, and others — are beyond the scope of this discussion. The prerequisites are:

  • Distributed/pseudo-distributed Hadoop 1.1.2 installed and running
  • MySQL database installed and running
  • HBase 0.94.5
  • Hive 0.10.0
  • Sqoop 1.4.3
  • ZooKeeper 3.4.5
  • A shared meta-store schema in MySQL

Data preparation in MySQL database


Listing 1. SQL script to create table bse30 and bsefmcg in the database
         

CREATE TABLE bse30 (
date DATE PRIMARY KEY NOT NULL,
index_name VARCHAR(20) NOT NULL,
open DOUBLE,
high DOUBLE,
low DOUBLE,
close DOUBLE);


CREATE TABLE bsefmcg (
date DATE PRIMARY KEY NOT NULL,
index_name VARCHAR(20) NOT NULL,
open DOUBLE,
high DOUBLE,
low DOUBLE,
close DOUBLE);


Listing 2. SSQL script to load data in the tables
          
LOAD DATA LOCAL INFILE '/home/surajit/hadoop/hivestuff/BSE30.csv'
INTO TABLE hadoop.bse30
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(@datVar, `index_name`, `open`, `high`, `low`, `close`)
SET `date` = STR_TO_DATE(@datVar, '%d-%b-%y');

LOAD DATA LOCAL INFILE '/home/surajit/hadoop/hivestuff/BSEFMCG.csv'
INTO TABLE hadoop.bsefmcg
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(@datVar, `index_name`, `open`, `high`, `low`, `close`)
SET `date` = STR_TO_DATE(@datVar, '%d-%b-%y');


Listing 3. A snapshot of bsefmcg table

+------------+--------------+---------+---------+---------+---------+
| date       | index_name   | open    | high	| low     | close   |
+------------+--------------+---------+---------+---------+---------+
| 2012-12-03 | S&P BSE FMCG | 6030.46 | 6045.94 | 6001.85 | 6017.39 |
| 2012-12-04 | S&P BSE FMCG | 6021.03 | 6040.69 |  5992.6 | 6031.57 |
| 2012-12-05 | S&P BSE FMCG | 6059.01 |  6079.9 | 6030.04 | 6045.15 |
| 2012-12-06 | S&P BSE FMCG | 6071.92 | 6087.24 | 5994.67 |  6075.4 |
| 2012-12-07 | S&P BSE FMCG | 6081.08 |  6102.7 | 6041.48 | 6066.73 |
| 2012-12-10 | S&P BSE FMCG | 6054.81 | 6092.25 | 6046.52 | 6081.14 |
| 2012-12-11 | S&P BSE FMCG | 6090.31 |  6176.5 |  6083.6 | 6142.21 |
| 2012-12-12 | S&P BSE FMCG | 6148.52 | 6171.81 | 6113.93 | 6132.31 |
| 2012-12-13 | S&P BSE FMCG | 6124.06 | 6124.06 | 5960.12 | 5970.55 |
| 2012-12-14 | S&P BSE FMCG | 5950.67 | 5996.69 | 5913.77 |  5975.6 |
| 2012-12-17 | S&P BSE FMCG | 5973.57 |  5979.5 | 5929.84 | 5944.28 |
| 2012-12-18 | S&P BSE FMCG | 5938.92 | 5988.08 | 5897.83 | 5964.05 |
| 2012-12-19 | S&P BSE FMCG |  5962.3 | 5983.85 | 5928.04 | 5941.64 |
| 2012-12-20 | S&P BSE FMCG | 5942.52 | 5970.25 | 5914.15 |  5949.4 |
| 2012-12-21 | S&P BSE FMCG | 5929.25 | 5949.62 |  5883.6 | 5924.29 |
| 2012-12-24 | S&P BSE FMCG |  5956.1 | 5963.48 | 5910.51 | 5920.24 |
| 2012-12-26 | S&P BSE FMCG | 5923.69 | 5963.57 | 5900.83 | 5945.29 |
| 2012-12-27 | S&P BSE FMCG | 5971.29 | 5973.44 |  5903.4 | 5916.24 |
| 2012-12-28 | S&P BSE FMCG | 5907.14 | 5944.86 | 5907.14 | 5932.07 |
| 2012-12-31 | S&P BSE FMCG | 5939.92 | 5939.92 |  5902.9 | 5916.22 |
+------------+--------------+---------+---------+---------+---------+


Data migration

Importing data into HDFS

Sqoop reads data from a table in a database row by row and uploads it into an HDFS file system. The import process is executed in parallel and uploads files in various formats such as binary Avro, SequenceFile, or delimited text file (for example, comma- or tab-separated), as per the configured parameters. In the import process, a Java class gets generated as a by-product that can parse a text file and serialize or de-serialize a SequenceFile or Avro file. Sqoop can be used with several databases that support a JDBC driver. Download the appropriate JDBC driver for the type of database you want to import, and install the JAR file in the $SQOOP_HOME/lib directory on the client machine.


Listing 4. Sqoop command-line script for data import execution for HDFS file system

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop 
   --driver com.mysql.jdbc.Driver --username root --password root --table bse30

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop 
   --driver com.mysql.jdbc.Driver --username root --password root --table bsefmcg

The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line) or in binary representation as an Avro data file or SequenceFiles. The --connect argument takes the database connection URL as input. If Sqoop is used in a distributed Hadoop cluster, the localhost can be replaced with the host name and IP address of the remote database server. The input connection URL will be used by the TaskTracker nodes throughout the Hadoop cluster. Additionally, the following optional arguments can be used:

  • --verbose: Print more information while working
  • --connection-param-file <filename>: Optional properties file that provides connection parameters
  • --append: Append data to an existing dataset in HDFS
  • --as-avrodatafile: Imports data to Avro data files
  • --as-sequencefile: Imports data to SequenceFiles
  • --as-textfile: Imports data as plain text (default)
  • --columns <col,col,col…>: Columns to import from table, in case entire table is not required
  • --direct: Use direct import fast path
  • --direct-split-size <n>: Split the input stream every n bytes when importing in direct mode
  • --fetch-size <n>: Number of entries to read from database at once
  • --inline-lob-limit <n>: Set the maximum size for an inline LOB
  • -m,--num-mappers <n>: Use n map tasks to import in parallel; by default, four tasks are used
  • -e,--query <statement>: Import the results of statement
  • --split-by <column-name>: Column of the table used to split work units
  • --table <table-name>: Table to read
  • --target-dir <dir>: HDFS destination dir
  • --warehouse-dir <dir>: HDFS parent for table destination
  • --where <where clause>: Clause to use during import; filters records to be fetched
  • -z,--compress: Enable compression
  • --compression-codec <c>: Use Hadoop codec (default gzip)
  • --null-string <null-string>: String to be written for a null value for string columns
  • --null-non-string <null-string>: String to be written for a null value for non-string columns

InfoSphere BigInsights Quick Start Edition

InfoSphere® BigInsights™ Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

While performing parallel imports, you must specify a criterion that can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop identifies the primary key column (if present) in a table and uses it as the splitting column. The high and low values for the splitting column are retrieved from the database, and the map tasks operate on evenly sized components of the total range. For example, if the table has a primary key column of ID whose minimum value is 0 and maximum value is 1,000, and Sqoop is directed to use four tasks, Sqoop would run four processes that each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. If the actual values for the primary key are not uniformly distributed across its range, this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Currently, Sqoop cannot split on multi-column indices. If the table has no index column, or has a multi-column key, a splitting column must be specified manually.


Listing 5. Various JDBC and SqoopOptions configuration parameters

/* CONSTANTS */
private static final String JOB_NAME = "Sqoop HDFS Job";
private static final String MAPREDUCE_JOB = "HDFS Map Reduce Job";
private static final String DBURL = "jdbc:mysql://localhost:3306/hadoop";
private static final String DRIVER = "com.mysql.jdbc.Driver";
private static final String USERNAME = "root";
private static final String PASSWORD = "root";	
private static final String HADOOP_HOME = "/home/surajit/hadoop/hadoop-1.1.2";
private static final String HDFS_DIR = "/user/surajit/";
private static final String JAR_OUTPUT_DIR = "/home/surajit/tmp/sqoop-surajit/compile";
private static final String TARGET_DIR = "hdfs://localhost:9000/user/surajit/";
private static final String SUCCESS = "SUCCESS !!!";
private static final String FAIL = "FAIL !!!";

/**
  * Imports data from RDBMS MySQL and uploads into Hadoop file system
  * @param RDBMS Table  Name
  */
public void importToHadoop(String table){
	log.info("Importing data into HDFS …....");
	SqoopOptions options = new SqoopOptions(DBURL,table);
	options.setDriverClassName(DRIVER);
	options.setUsername(USERNAME);
	options.setPassword(PASSWORD);
	options.setFieldsTerminatedBy('\t');
	options.setHadoopMapRedHome(HADOOP_HOME);
	options.setJobName(JOB_NAME);
	options.setLinesTerminatedBy('\n');
	options.setMapreduceJobName(MAPREDUCE_JOB);
	options.setTableName(table);
	options.setJarOutputDir(JAR_OUTPUT_DIR);
	options.setTargetDir(TARGET_DIR + table);
	
	ImportTool it = new ImportTool();
	int retVal = it.run(options);
	if(retVal == 0){
		log.info(SUCCESS);
	}
	else{
		log.info(FAIL);
	}
}

Exporting data from HDFS

The export tool exports a set of files from HDFS back to an RDBMS table. The target table must already exist in the database. The input files are read and parsed into a set of records according to the specified delimiters. The default operation transforms these into a set of INSERT statements that inject the records into the database. In update mode, Sqoop will generate UPDATE statements that replace existing records in the database, and in call mode, Sqoop will make a stored procedure call for each record.

Exports are performed by multiple writers in parallel. Separate connections and transactions are used by each writer. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete.


Listing 6. Sqoop command-line script for data export execution for HDFS file system

$ bin/sqoop export --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --password root --table bse30 
  --export-dir /user/surajit/bse30

$ bin/sqoop export --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --password root --table bsefmcg 
  --export-dir /user/surajit/bsefmcg

The following arguments can also be used while performing an export operation:

  • --export-dir <dir>: HDFS source path for the export
  • -m,--num-mappers <n>: Use n map tasks to export in parallel
  • --table <table-name>: Table to populate
  • --call <stored-proc-name>: Stored procedure to call
  • --update-key <col-name>: Anchor column to use for updates; use comma-separated list of columns if there are more than one column
  • --update-mode <mode>: Specify how updates are performed when new rows are found with non-matching keys in database; valid options are updateonly (default) and allowinsert
  • --input-null-string <null-string>: String to be interpreted as null for string columns
  • --input-null-non-string <null-string>: String to be interpreted as null for non-string columns
  • --staging-table <staging-table-name>: Table in which data will be staged before being inserted into the destination table
  • --clear-staging-table: Indicates that any data present in the staging table can be deleted
  • --batch: Use batch mode for underlying statement execution

Listing 7. Various JDBC and SqoopOptions configuration parameters

/* CONSTANTS */
private static final String JOB_NAME = "Sqoop HDFS Job";
private static final String MAPREDUCE_JOB = "HDFS Map Reduce Job";
private static final String DBURL = "jdbc:mysql://localhost:3306/hadoop";
private static final String DRIVER = "com.mysql.jdbc.Driver";
private static final String USERNAME = "root";
private static final String PASSWORD = "root";	
private static final String HADOOP_HOME = "/home/surajit/hadoop/hadoop-1.1.2";
private static final String HDFS_DIR = "/user/surajit/";
private static final String JAR_OUTPUT_DIR = "/home/surajit/tmp/sqoop-surajit/compile";
private static final String TARGET_DIR = "hdfs://localhost:9000/user/surajit/";
private static final String SUCCESS = "SUCCESS !!!";
private static final String FAIL = "FAIL !!!";	

/**
 * Exports data from the HDFS file system to the RDBMS / MySQL schema
 * @param RDBMS Table  Name
 */
public void exportFromHadoop(String table){
	log.info("Exporting data from HDFS .....");
		
        /* RDBMS table columns in order */
	String outCols[] = {"month","index_name","open","high", "low", "close"};
        SqoopOptions options = new SqoopOptions(DBURL,table);
	options.setDriverClassName(DRIVER);
	options.setUsername(USERNAME);
	options.setPassword(PASSWORD);
	options.setFieldsTerminatedBy(',');
	options.setHadoopMapRedHome(HADOOP_HOME);
	options.setJobName(JOB_NAME);
	options.setLinesTerminatedBy('\n');
	options.setMapreduceJobName(MAPREDUCE_JOB);
	options.setTableName(table);
	options.setJarOutputDir(JAR_OUTPUT_DIR);
	options.setClearStagingTable(true);
	options.setExportDir(HDFS_DIR + table);
	options.setDbOutputColumns(outCols);
	options.setFieldsTerminatedBy(',');
	options.setUpdateMode(SqoopOptions.UpdateMode.AllowInsert);		
	ExportTool it = new ExportTool();
	int retVal = it.run(options);
	if(retVal == 0){
		log.info(SUCCESS);
	}
	else{
		log.info(FAIL);
	}
    }

Importing data into HIVE

A Hive metastore must be associated with HDFS for Sqoop to import data into Hive by generating and executing a CREATE TABLE statement to define the data's layout in Hive. The –hive-import option in the Sqoop command line will import data in Hive. If the Hive table already exists, --hive-overwrite will replace the existing table in Hive. After the data is imported into HDFS, Sqoop will generate a Hive script containing a CREATE TABLE operation defining the columns using Hive's types, and a LOAD DATA INPATH statement to move the data files into Hive's warehouse directory.


Listing 8. Sqoop command-line script for data import execution into Hive file system

$ bin/sqoop create-hive-table --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --P --table bse30

$ bin/sqoop create-hive-table --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --P --table bsefmcg

The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one to be imported. This effectively performs the --hive-import step of sqoop-import without running the preceding import. If data was already loaded to HDFS, this tool can be used to finish the pipeline of importing the data to Hive. Hive tables can also be created with this tool; data then can be imported and populated into the target after a preprocessing step. Additionally, following optional arguments can also be used:

  • --hive-table <table-name>: Sets the table name to use when importing to Hive
  • --hive-drop-import-delims: Drops \n, \r, and \01 from string fields when importing to Hive
  • --hive-delims-replacement: Replace \n, \r, and \01 from string fields with user-defined string when importing to Hive
  • --hive-partition-key: Name of a Hive field; partition is sharded onto
  • --hive-partition-value <v>: String value that serves as partition key for this imported into Hive in this job
  • --map-column-hive <map>: Override default mapping from SQL type to Hive type for configured columns

Listing 9. Various JDBC and SqoopOptions configuration parameters

/* CONSTANTS */
private static final String JOB_NAME = "Sqoop Hive Job";
private static final String MAPREDUCE_JOB = "Hive Map Reduce Job";
private static final String DBURL = "jdbc:mysql://localhost:3306/hadoop";
private static final String DRIVER = "com.mysql.jdbc.Driver";
private static final String USERNAME = "root";
private static final String PASSWORD = "root";		
private static final String HADOOP_HOME = "/home/surajit/hadoop/hadoop-1.1.2";
private static final String JAR_OUTPUT_DIR = "/home/surajit/tmp/sqoop-surajit/compile";
private static final String HIVE_HOME = "/home/surajit/hadoop/hive-0.10.0";
private static final String HIVE_DIR = "/user/hive/warehouse/";
private static final String WAREHOUSE_DIR = "hdfs://localhost:9000/user/hive/warehouse";
private static final String SUCCESS = "SUCCESS !!!";
private static final String FAIL = "FAIL !!!";

**
* Imports data from RDBMS MySQL and uploads into Hive environment
*/ 
 
public void importToHive(String table){

       System.out.println("SqoopOptions loading .....");		
		
		/* MySQL connection parameters */
		SqoopOptions options = new SqoopOptions();
		options.setConnectString(DBURL);
		options.setTableName(table);
		options.setDriverClassName(DRIVER);
		options.setUsername(USERNAME);
		options.setPassword(PASSWORD);
		options.setHadoopMapRedHome(HADOOP_HOME);
		
		/* Hive connection parameters */
		options.setHiveHome(HIVE_HOME);
		options.setHiveImport(true);
		options.setHiveTableName("bsefmcgh");
		options.setOverwriteHiveTable(true);
		options.setFailIfHiveTableExists(false);
		options.setFieldsTerminatedBy(',');
		options.setOverwriteHiveTable(true);
	options.setDirectMode(true);
	options.setNumMappers(1); // No. of Mappers to be launched for the job
	options.setWarehouseDir(WAREHOUSE_DIR);
	options.setJobName(JOB_NAME);
	options.setMapreduceJobName(MAPREDUCE_JOB);
	options.setTableName(table);
	options.setJarOutputDir(JAR_OUTPUT_DIR);		
		
	System.out.println("Import Tool running ....");
	ImportTool it = new ImportTool();
	int retVal = it.run(options);
	    if(retVal == 0){
	        log.info(SUCCESS);
	    }
	    else{
	        log.info(FAIL);
	    }
	}

Exporting data from Hive


Listing 10. Sqoop command-line script for data export execution from Hive file system

$ bin/sqoop export --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --password root --table bse30 
  --export-dir /user/hive/warehouse/bse30

$ bin/sqoop export --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --password root --table bsefmcg 
  --export-dir /user/hive/warehouse/bsefmcg


Listing 11. Various JDBC and SqoopOptions configuration parameters

/* CONSTANTS */
private static final String JOB_NAME = "Sqoop Hive Job";
private static final String MAPREDUCE_JOB = "Hive Map Reduce Job";
private static final String DBURL = "jdbc:mysql://localhost:3306/hadoop";
private static final String DRIVER = "com.mysql.jdbc.Driver";
private static final String USERNAME = "root";
private static final String PASSWORD = "root";		
private static final String HADOOP_HOME = "/home/surajit/hadoop/hadoop-1.1.2";
private static final String JAR_OUTPUT_DIR = "/home/surajit/tmp/sqoop-surajit/compile";
private static final String HIVE_HOME = "/home/surajit/hadoop/hive-0.10.0";
private static final String HIVE_DIR = "/user/hive/warehouse/";
private static final String WAREHOUSE_DIR = "hdfs://localhost:9000/user/hive/warehouse";
private static final String SUCCESS = "SUCCESS !!!";
private static final String FAIL = "FAIL !!!";
	
/**
 * Exports data from the Hive file system to the RDBMS / MySQL schema
 * @param RDBMS Table  Name
 */
public void exportFromHive(String table){		
	log.info("Exporting data from HIVE.....");
	/* RDBMS table columns in order */
	String outCols[] = {"month", "index_name", "open", "high", "low", "close"};
	/* MySQL connection parameters */
	SqoopOptions options = new SqoopOptions(DBURL,table);
	options.setDriverClassName(DRIVER);
	options.setUsername(USERNAME);
	options.setPassword(PASSWORD);

	/* Hive connection parameters */
	options.setFieldsTerminatedBy(',');
	options.setHadoopMapRedHome(HADOOP_HOME);
	options.setJobName(JOB_NAME);
	options.setLinesTerminatedBy('\n');
	options.setMapreduceJobName(MAPREDUCE_JOB);
	options.setTableName(table);
	options.setJarOutputDir(JAR_OUTPUT_DIR);
	options.setClearStagingTable(true);
	options.setExportDir(HIVE_DIR + table);
	options.setDbOutputColumns(outCols);
	options.setFieldsTerminatedBy(',');
	options.setUpdateMode(SqoopOptions.UpdateMode.AllowInsert);
	
	ExportTool it = new ExportTool();
	int retVal = it.run(options);
	if(retVal == 0){
		log.info(SUCCESS);
	}
	else{
		log.info(FAIL);
	}
}

Importing data into HBase

When --hbase-table is specified in the Sqoop command line, Sqoop imports the data into HBase table instead of a directory in HDFS. Each row of the input table is transformed into a row in the HBase table through an HBase put operation. By default, Sqoop uses split-by column as the row key column. If the column is not specified, Sqoop looks for the primary key column, which is specified by --hbase-row-key. Each output column is placed in the same column family, which is specified by --column-family. If the target table and column family are not specified, the Sqoop job exits with an error. This limitation can be overcome by using the --hbase-create-table argument. Additionally, the following optional arguments can also be used:

  • --bindir <dir>: Output directory for compiled objects
  • --class-name <name>: Sets the generated class name; overrides --package-name and when combined with --jar-file, sets the input class
  • --jar-file <file>: Disable code generation; use specified JAR
  • --outdir <dir>: Output directory for generated code
  • --package-name <name>: Put auto-generated classes in this package
  • --map-column-java <m>: Override default mapping from SQL type to Java type for configured columns

Listing 12. Sqoop command-line script for importing an entire table from RDBMS into HBase

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --P --table bse30 --hbase-row-key date 
  --hbase-create-table --column-family firstdecade --hbase-table bse30

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/hadoop 
  --driver com.mysql.jdbc.Driver --username root --P --table bsefmcg --hbase-row-key date 
  --hbase-create-table --column-family firstdecade --hbase-table bsefmcg


Listing 13. Various JDBC and SqoopOptions configuration parameters

/* CONSTANTS */
private static final String JOB_NAME = "Sqoop HBase Job";
private static final String MAPREDUCE_JOB = "HBase Map Reduce Job";
private static final String DBURL = "jdbc:mysql://localhost:3306/hadoop";
private static final String DRIVER = "com.mysql.jdbc.Driver";
private static final String USERNAME = "root";
private static final String PASSWORD = "root";		
private static final String HADOOP_HOME = "/home/surajit/hadoop/hadoop-1.1.2";
private static final String JAR_OUTPUT_DIR = "/home/surajit/tmp/sqoop-surajit/compile";
private static final String SUCCESS = "SUCCESS !!!";
private static final String FAIL = "FAIL !!!";


/**
 * Imports data into HBase table
 * @param table
 */
public void importToHbase(String table){
    int retVal = 0;
    SqoopOptions options = new SqoopOptions(DBURL,table);
    Options.setDriverClassName(DRIVER);
    options.setUsername(USERNAME);
    options.setPassword(PASSWORD);
    options.setHadoopMapRedHome(HADOOP_HOME);
    options.setHBaseTable(table);
    options.setHBaseRowKeyColumn("date");
    options.setHBaseColFamily("firstdecade");		
    options.setJobName(JOB_NAME);		
    options.setMapreduceJobName(MAPREDUCE_JOB);
    options.setTableName(table);
    options.setJarOutputDir(JAR_OUTPUT_DIR);
    options.setCreateHBaseTable(true);
    options.setDirectMode(true);

    ImportTool it = new ImportTool();

        retVal = it.run(options);
        if(retVal == 0){
	    log.info(SUCCESS);
        }
        else{
	    log.info(FAIL);
        }
}

Exporting data from HBase

When the --hbase-table parameter is specified in import execution, Sqoop imports the data into the HBase table instead of HDFS directory. Sqoop serializes all values to HBase by converting each value to its string representation, then inserts UTF-8 bytes of this string into the target cell in the HBase table. While exporting the data from HBase to HDFS, Sqoop creates SequenceFiles. One limitation with use of SequenceFiles is that there is no generic approach to access data in the SequenceFile. Access to the writable class is required that was used to write the data, as Sqoop code-generates the class file. This introduces a serious problem: When the Sqoop is upgraded to a newer version, the code-generated class file won't be able to access the SequenceFile, created by the older version of Sqoop. Due to this limitation, it's not recommended to use HBase SequenceFiles for export operation using Sqoop. There are certain other means of exporting HBase tables to RDBMS; this is beyond the scope of this article.


Conclusion

As the volume of transactional data is growing along with semi-structured and unstructured data, legacy systems fail to scale adequately to support real-time analytics to interpret the data for a profitable business decision. Although Apache Hadoop and other related solutions are capable of bridging the gap, these technologies have their own limitations that do not allow them to process the data directly from an RDBMS. Hence, data migration between heterogeneous platforms using tools like Sqoop is essential to achieve the required scalability. With consistent effort from the open source developers' community, Sqoop is getting enriched to Sqoop2 with significant improvement of various features. With gradual improved features Sqoop will certainly play a significant role among the Big Data analytics solutions.



Download

DescriptionNameSizeDownload method
Sample codesqoop-artifacts.zip100KBHTTP

Information about download methods


Resources

Learn

Get products and technologies

Discuss

About the author

Surajit Paul

Surajit Paul is a techno-functional consultant with a decade of experience in product development and consulting in various industry verticals, including supply-chain management, retail distribution, and investment banking. He is PMI Certified Project Management Professional, APICS Certified Supply Chain Professional, and ASQ Certified Six Sigma Green Belt.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data, Information Management
ArticleID=938232
ArticleTitle=Sqoop: Big data conduit between NoSQL and RDBMS
publish-date=07232013