Extract data from Excel sources in IBM InfoSphere Information Server using DataStage Java Integration Stage and Java Pack

A step-by-step guide

Explore the functions of the Java™ Integration Stage (DataStage Connector) introduced in IBM® InfoSphere® Information Server version 9.1. This article addresses the Excel data source connectivity problem in older releases of IBM InfoSphere Information Server (7.5.x, 8.0.x, 8.1.x, 8.5.x, and 8.7.x) using Java Pack Plug-ins and the Java Pack API. The older releases of Information Server do not have a dedicated component for Excel connectivity. Third-party ODBC to ODBC bridges were used as an alternative, but they are licensed. The Java Pack Stages coupled with the Java Pack API and Apache POI API can be used to fetch Excel data into DataStage in a cost-effective way.

Santhosh Kumar Kotte (sankotte@in.ibm.com), Staff Software Engineer, IBM  

Photo of Santhosh KotteSanthosh is a staff software engineer working for IBM InfoSphere Information Server Group at the India Software Lab, Hyderabad. He joined IBM in 2007 as an Associate Software Engineer. He works on the development & maintenance of Information Server DataStage ASB Packs (XML Pack 2.0, Java Pack 2.0, and Web Services Pack 2.0) and InfoSphere FastTrack. His primary interests are working with emerging technologies like XML, DB2, pureXML, SOA and web services, and exploring various information integration technologies. His programming interests include C, C++, Java, and XML.



23 May 2013

Also available in Chinese

Introduction

IBM InfoSphere Information server consists of a suite of data integration products that help organizations to derive business value from the information that is spread across their many source systems. It helps to profile, cleanse, and integrate information from multiple heterogeneous sources in a cost-effective way. With IBM InfoSphere Information Server, organizations can improve their operational efficiency at a lower cost and lower risk.

IBM Information Server DataStage is an ETL product that helps organizations to extract, transform, and load data from heterogeneous data sources. With its high performance parallel framework, DataStage solves complex business problems by integrating with various data sources.

DataStage Java Integration Stage and DataStage Java Pack plug-ins

The Java Integration Stage is a DataStage Connector through which you can call a custom Java application from InfoSphere Data Stage and Quality Stage parallel jobs. The Java Integration Stage is available from IBM InfoSphere Information Server version 9.1 and higher. The Java Integration Stage can be used in the following topologies: as a source, as a target, as a transformer, and as a lookup stage. For more information on the Java Integration Stage, see Resources.

The DataStage Java Pack is a collection of two plug-in stages, Java Transformer and Java Client, through which you can call Java applications from DataStage. The Java Pack is available from DataStage version 7.5.x and higher.

The Java Transformer stage is an active stage that can be used to call a Java application that reads incoming data, transforms it, and writes it to an output link defined in a DataStage job. The Java Client stage is a passive stage that can be used as a source, as a target, and as a lookup stage. When used as a Source, the Java Client will be producing data. When used as a target, the Java Client Stage will be consuming data. When used as a lookup, the Java Client Stage will perform lookup functions.

For more information on the Java Pack Stages, see Resources.

Depending on the Information Server version deployed, you can either use the Java Integration Stage, the Unstructured Data Stage, or any Java Pack Stages to extract Excel data. If you are using Information Server version 9.1, you can use either the Unstructured Data Stage or the Java Integration Stage to extract Excel data. If you are using a version prior to 9.1 (such as 8.7.x, 8.5.x, 8.1.x, 8.0.1, and 7.5.x), you must use the Java Pack Stages (Java Transformer Stage or Java Client Stage) to extract the data from Excel documents. This article discuss how Excel data can be extracted using both the Java Integration Stage and the Java Pack Client Stage.


DataStage Java Integration Stage and Java Client Stage GUI properties

Java Integration Stage GUI

The Java Integration Stage and Java Client Stage are available under Real time stages in the DataStage Designer Palette. To use the Java Integration Stage and Java Pack Stages, launch the DataStage Designer, create a new job, expand the Real Time Stages category in the Palette, and then drag and drop the stages.

Figure 1. Java Integration Stage and Java Pack Stages in the Palette
The Pallete window with a square drawn around Java Client and Jsva transformer.

Depending on the context of usage (source, target, and lookup), the Java Integration stage supports different properties. The stage properties must be configured to define Java virtual machine configurations and your Java class that is to be invoked from the Java Integration stage. The input link and output link properties must be configured based on the job design.

Figure 2. Stage Properties
The properties tab on the Java Integration window
  • Stage name contains the user-defined Stage name.
  • Description contains the description of the Stage.
  • Java contains the JVM properties.
  • Classpath contains the Java classpath. Use a semicolon ";" as the path separator.
  • Heap size (MB) contains the heap sizes (in MB). This corresponds to the -Xmx command-line option, with a minimum allowed value of 128. The default value is 256.
  • Optional JVM options contains the command-line options for the Java Virtual Machine.
  • User class contains the Java class name to be run from the Stage.
  • Use user-defined function determines whether to use a user-defined function or an implementation of the Java Integration stage class.
  • User function contains the Java Class method to be invoked by the stage.
  • Custom properties contains the custom stage properties passed to the Java Class at runtime.
Figure 3. Output properties
Output tab on the Java integration window

Use the properties in the Properties tab to configure Java Integration Stage as a source. The Properties tab is available only for the output link and not for a reject link.

  • Custom properties specifies the custom stage properties passed to the user's class at runtime.
  • Is reject link (Compatibility) indicates whether the stream link is a reject link. This is used only when running Java Transformer code. The default value is No.
  • Column mapping specifies the mapping between a DataStage column and a bean property. This property is set by the Column Mapping Editor window.
  • JavaBeans class specifies the JavaBeans class associated with the link. This property is set by the Column Mapping Editor window.

Java Pack Client Stage GUI

Figure 4. Java Client Stage properties
The Stage properties on the JClientPX stage window
  • Transformer Class Name contains the name of the class to be run by the stage.
  • User's Classpath contains the Java classpath with a semicolon ";" as the path separator.
  • Description contains a description of the stage.
Figure 5. Java Client Stage User Properties
The properties tab on the JClientPX stage window
  • User's Properties Contains the user properties passed to the Java class at runtime.
Figure 6. Java Client Stage Options
The options tab on the JClientPX stage window

You can specify different JVM options in the Option field.


Java Integration Stage and Java Pack API

Java Integration Stage and Java Pack Stages allow external Java applications to access DataStage and Quality Stage by exposing their API. The external Java applications must be developed using the Java Integration API to get access to DataStage and Quality Stage.

  • Java Integration Stage API: A subclass of com.ibm.is.cc.javastage.api.Processor. This subclass is available as part of the ccjava-api.jar file.
  • Java Pack API: A subclass of com.ascentialsoftware.jds.Stage. This subclass is available as part of the tr4j.jar file.

In this article, the tr4j.jar API is used to develop the Java applications. The Java application developed using the tr4j.jar API can be invoked either from the Java Integration Stage or the Java Client Stage.

Developing a Java application using the Java Pack API

Your Java program must implement a subclass of the Stage class, which is provided as part of the tr4j.jar file. The Stage class consists of methods for manipulating rows and querying metadata. The Stage class is the root of all classes that implement a Java Client or Java Transformer stage in your jobs.

  • Override the Stage.initialize() method to define initialization logic.
  • Override the Stage.process() method to define processing logic.
  • Override the Stage.terminate() method to define termination logic.

The intialize() and terminate() methods are optional. Listing 1 shows the skeleton of a simple Java class developed using the Java Pack API. One input row is read and processed and one output row is generated in the process() method.

Listing 1. Skeleton of the Java Application that uses the Java Integration API
public class Mytransformer extends Stage 
{ 
	public void initialize() 
	{ 
	// ...initialization logic 
	} 
	public void terminate() 
	{ 
	// ...termination logic 
	} 
				
	//The core processing logic of the Java application.
	public int process() 
	{	 
	// ...process input row... 
	Row inputRow = readRow(); 
	
	// ...fill output row... 
	Row outputRow = createOutputRow(); 
	writeRow(outputRow); 
	return OUTPUT_STATUS_READY; 
	} 
}

Apache POI API

Apache POI is a project run by the Apache Software Foundation that is used to develop Java applications for reading and writing files in Microsoft™ Office formats such as Word, Power Point, and Excel. Java ports of file formats based on Office Open XML (ooxml) can also be developed using the POI API. The Apache POI distribution provides support for various document file formats as part of several jar files. POI-HSSF and POI-XSSF are the Java APIs to access Microsoft Excel documents. HSSF is the POI Project's pure Java implementation of the Excel (97-2007) file format, and XSSF is the pure Java implementation of the Excel 2007 OOXML (.xlsx) file format. HSSF and XSSF provide ways to create, modify, read, and write Excel spreadsheets. In short, to read .xls files you need the HSSF API, and to read .xlsx files you need the XSSF API. In this article, you will be working with the HSSF API to extract data from a .xls document. For more details on Apache POI, refer to the official Apache website (see Resources).

Reading Excel documents using HSSF API

To read in a file, create a new instance of org.apache.poi.poifs.Filesystem, passing in an open InputStream, such as a FileInputStream for your XLS, to the constructor. Construct a new instance of org.apache.poi.hssf.usermodel.HSSFWorkbook, passing the Filesystem instance to the constructor. From there, you have access to all of the high-level model objects through their accessor methods (workbook.getSheet(sheetNum), sheet.getRow(rownum), and so on).

The following packages need to be imported to work with the Excel document:

  • org.apache.poi.hssf.usermodel.HSSFWorkbook is a high-level representation of a workbook object. This is the primary object needed for performing read and write operations on a workbook. This is the top-level object needed for creating sheet objects.
  • org.apache.poi.hssf.usermodel.HSSFSheet is a high-level representation of the worksheet object of a workbook.
  • org.apache.poi.hssf.usermodel.HSSFRow is a high-level representation of the row object of a worksheet.
  • org.apache.poi.hssf.usermodel.HSSFCell is a high-level representation of a cell object in a row of a spreadsheet. Cell type specifies whether the cell content is numeric, string, formulas, and so on. Conversion logic of cell content is left to the user applications.

Excel application using the Java Integration API and Apache POI HSSF API

To read Excel data in DataStage, a Java application must be developed using the combination of the DataStage Java Integration API and the Apache POI HSSF API. The Java Integration Stage API acts as an integration point between the external Java Application and DataStage. The Apache POI HSSF API acts as an integration point between the Excel data source and the Java Application. A brief picture of what code goes into the Java Application methods is:

The initialize() method

This is a standard Java Pack API method. All the initialization logic goes here. The following objects are initialized here.

  • The FileInputStream object to read the input Excel file.
  • The HSSFWorkbook object, which holds the Excel workbook object.
  • The HSSFSheet object, which contains the sheet object of the workbook.
  • The HSSFFormulaEvaluator object needed to work with cell formulas, if any. This example does not work with these formulas.
Listing 2. The intialize() method
public void initialize()
{
	//Java Pack API Logging method to log messages to Director Log.
	info("*****Initializing Excel Application*****"); 
		
	//Code block to initalize and read user property values provided in the Stage GUI.
	byte[] userProperties 		= getUserProperties().getBytes();
	InputStream propertyStream 	= new ByteArrayInputStream(userProperties);
	Properties properties 		= new Properties();
	
	try
	{
		properties.load(propertyStream);  //User properties loaded.
	}
	
	//Log a message if user properties cannot be loaded.
	catch(IOException e)
	{
		info("*****Could not load user properties");
	} 	
	
	//Read the value of the user property named ExcelFileName
	String propertyValue 	= properties.getProperty("ExcelFileName" ); 		
	
	//Code block to create the objects of 
	//the File input stream,workbook and sheet.
	try 
	{
		FileInputStream fis		=new FileInputStream(propertyValue);
		workbook			=new HSSFWorkbook(fis);
		formulaEval			=new HSSFFormulaEvaluator(workbook);
	
	//Get the first sheet object
	sheet					=workbook.getSheetAt(0);
	
	//Create a row iterator for the sheet  
		rows				=sheet.rowIterator(); 
	}	 
	catch (FileNotFoundException e) 
	{
		//Log a message if the input excel file is not found
		info("The input Excel file is not found"); 
	}
	catch(IOException ioExp)
	{
		//Log the IO Exception
		info("*****IO Exception*****"); 
	}	
	catch(org.apache.poi.hssf.OldExcelFormatException UnsupportedExcelFormatException)
	{
		//Log a message if the Excel format is not supported.
		info("*****Cannot read old Excel format*****"); 
	}		
}

The process() method

This is a standard Java Pack API method. The core processing logic goes here, and the logic to extract data from the Excel document is defined here. This method first checks if there are any rows in the sheet. If yes, it creates an output row object using the createOutputRow() method. Each row is checked for the presence of cells, and a cell iterator is used to iterate through the cells. The cell value is extracted by the user-defined method extractCellValue(). The extracted cell value is assigned to the output row object using the outputRow.setValueAsString() method. Finally, the output row is written to the DataStage job output link using the writeRow() method.

Listing 3. The process() method
public int process()
{ 
	//If rows exist
	if(rows.hasNext())								
	{
		int colCount		= 0;
			
		//Create an output row object to hold the row data
		Row outputRow 		= createOutputRow(); 		
			
		//Create an Excel row object
		HSSFRow hrow 		= (HSSFRow) rows.next();	
			
		//Create a cell iterator for the row
		Iterator<Cell> cells 	= hrow.cellIterator();	
			
		//If cells exist
		while (cells.hasNext())						
		{
			HSSFCell hcell 	= (HSSFCell) cells.next();
				
			//Extract cell value
			String cellData = extractCellValue(hcell,hcell.getCellType());	
				
			//Assign the row data to the outout row object
			outputRow.setValueAsString(colCount,cellData);		
			colCount++;
		}
		info("*****Writing Excel data to Target*****");			
		//Write the row to the output link in the job
		writeRow(outputRow);
		return OUTPUT_STATUS_READY;				
	}
		return OUTPUT_STATUS_END_OF_DATA;
}

The extractCellValue() method

This is a user-defined method that is called by the process() method. An Excel file can contain different data type values like strings, booleans, and integers. Therefore, the cell values must be extracted based on their data types. The extractCellValue() method returns the cell contents based on their type. It inspects each cell for its type and returns the content to the process() method, which actually writes the data to the output row.

Listing 4. The extractCellValue() method
protected String extractCellValue(HSSFCell cell, int hcellType)
{
	String cell_value;
	formatter = new DataFormatter();
	switch (hcellType) 					
	{
		//If Cell is blank
		case HSSFCell.CELL_TYPE_BLANK:	
		cell_value = "";
		break;
			
		//If Cell value is boolean
		case HSSFCell.CELL_TYPE_BOOLEAN:
		cell_value = "" + cell.getBooleanCellValue();
		break;
			
		//If Cell value is string
		case HSSFCell.CELL_TYPE_STRING:
		cell_value = cell.getRichStringCellValue().toString();
		break;
			
		//Invalid cell
		case HSSFCell.CELL_TYPE_ERROR:		
		cell_value = "ERROR";
		break;
			
		//If Cell value is numeric
		case HSSFCell.CELL_TYPE_NUMERIC:
		cell_value = formatter.formatCellValue(cell, formulaEval);
		break;
			
		//Default Cell value
		default:							
		cell_value = "DEFAULT_VALUE";
		break;
		}
	return cell_value;
}

The terminate() method

This is a standard Java Pack API method. This method is called by the Java Integration Stage when the core processing is complete.

Listing 5. The terminate() method
public void terminate()
{
	// Log information message while terminating the Application.	
	info("*****Terminating Excel Application*****");
}

Note: This example has worked only with simple data types. You can always extend the functions using the Apache POI API to work with complex data types and to even work with formulas. The example also assumes that there is only one sheet in the Excel document, but in reality, there will be many sheets. Apache POI API provides functions to work with multiple sheets. Explore the Apache POI API to learn more about the supported functions.


Invoking the Excel application from DataStage

After the Excel Java application is developed, it must be compiled by keeping the DataStage tr4j.jar and the Apache poiXX.jar files in the Java classpath. The resulting Java class file must be used in the DataStage job to extract the Excel data and transform it into other formats (such as, relational data).

The DataStage job design

Drag and drop either the Java Integration Stage or the Java Client Stage from the Real Time Packs palette and create an output link to the target Stage. In this example, the target stage is a Sequential File Stage. The job design looks like Figure 7 and Figure 8.

Figure 7. The job design using Java Pack Client Stage
The Job design using Java Pack Client Stage
Figure 8. The job design using Java Integration Stage
The Job design using Java Integration Stage

Note that the target Stage could be anything as per your requirement. For example, if you want to write the Excel data to a database, you can use Orace EE Stage, DB2 Connector Stage, DB2 Plug-in Stage, and so on, as the target.

The Java Intgeration Stage properties

In the Classpath field, mention the absolute path of the directory name that contains the .class file. In this example, the ReadExcelData.class is located in the c:\santhosh directory.

In the Optional JVM options field, mention the POI jar file location in the classpath option. This example's jar file is located in the c:\santhosh directory. You are advised to download the Apache POI jar file from the Apache download site.

In the Users Class field, mention the name of the Java class that extracts the Excel data. The name of this example's Java class is ReadExcelData.

In the Custom Properties field, mention the absolute path of the input Excel file. This example's input Excel file named EmployeeData.xls is present in the c:\santhosh directory.

Note: The Java application expects the path names to contain double slashes. Therefore, provide the path name carefully by using double slashes.

Figure 9. The Stage properties
The Stage Properties

The Java Integration Stage Output column definitions

The sample Excel file has employee data such as name, age, salary, division, and location. Because you must extract and write this data to the target stage, five columns must be defined in the Output tab. I have named the five columns as EmpName, EmpAge, EmpDivision, EmpSalary, and EmpLocation. The columns definitions look like Figure 10.

Figure 10. The output column definitions
The output column definitions

When the output columns are defined, they automatically get propagated into the target Sequential File Stage column properties. In the Sequential File Stage input properties, you just need to mention the name of the output file that will contain the extracted Excel data as relational data. The name of my output file is ExtractedData.txt.

Figure 11. The Sequential File Stage properties
The Sequential File Stage properties

If the Java Client Stage is used as the source instead of the Java Integration Stage, fill out the Stage properties accordingly. Figure 12 and Figure 13 show the Java Client Stage properties.

Figure 12. Java Client Stage properties
Java Client Stage Properties
Figure 13. Java Client Stage user properties
Java Client Stage User Properties

Compiling and executing the DataStage job

Save the job after configuring all the job properties, then press F7 to compile the job. After the job compiles successfully, press Ctrl+F5 to execute the job. When the job is executed, the Java Integration Stage calls the ReadExcelData class, extracts the Excel data from the Excel file, and writes the extracted data to the EmployeeData.txt file. You can see that five rows are read from the Excel file and written to the target Sequential File Stage.

Figure 14. The Java Integration Stage job execution
The Java Integration Stage job execution
Figure 15. The Java Client Stage job execution
Web Service Client SSL Options
Figure 16. The extracted Excel data
The extracted Excel data

Note: If the number of columns in the input Excel file does not match the number of columns defined in the output link, a java.lang.IndexOutOfBoundsException is generated in the Director Log. Therefore, you should ensure that the number of columns in the Excel file and the DataStage job output link match. To make the Java Application more effective, you can add robust exception handling mechanism to your code.


Conclusion

In this article, you received an introduction to IBM InfoSphere Information Server and DataStage as well as the Java Integration Stage, Java Pack, and the Java Pack API. You learned about the Excel processing capabilities of the Apache POI and how you can develop a Java application to extract Excel data by using the Java Pack API and the Apache POI API. Additionally, the article described how a DataStage job can be designed using the Java Integration Stage and the Java Pack Client Stage to call the Java application that extracts the Excel data. The solution described in this article can be effectively used in Information Server releases that do not have a dedicated component for reading Excel files.

The Apache POI API and Java Pack API can be used in DataStage Java Pack jobs to create Excel documents from non-Excel formats as well. To work with such a scenario, follow the approach described in this article.


Download

DescriptionNameSize
Sample DataStage jobs for this articleSampleJobs.zip30KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=930835
ArticleTitle=Extract data from Excel sources in IBM InfoSphere Information Server using DataStage Java Integration Stage and Java Pack
publish-date=05232013