Read, recycle, and reuse: Reporting made easy with Excel, XML, and Java technologies, Part 1

Read Excel files and write them to new files using Java and XML technologies

Extracting business data is a challenge every company faces. Discover some of the secrets to extracting data from Excel and converting it between Excel and XML using Java™ technology.

Shaene M Siders (consult@DragonUnderGlass.com), Writer and Consultant, Dragon Under Glass

Photo of Shaene SidersShaene Siders is a consultant and writer for technology, training, and entertainment. She started a writing and Web development company in 1996 and has been an IBM Rational Certified Instructor for ClearCase and ClearQuest since 2003. She is the author and producer of the training courseware series Java for Mad Scientists, and she is also an IBM Rational Certified Instructor for Build Forge, Rational Functional Tester, Rational Performance Tester, and Rational Manual Tester. You can reach Shaene at consult@DragonUnderGlass.com.



02 March 2010 (First published 16 February 2010)

Also available in Chinese Japanese Vietnamese Portuguese

02 Mar 2010 Added links to Part 2 in sections The sample application, Conclusion, and Resources

Introduction

Creative excuse number 432: Reporting is bad for the environment.

Charts and reports consume trees. Electronic versions use electricity and fossil fuels. Perhaps a company could "go green" and stop writing reports?

Developing reports for senior staff can make Java programmers anxious, especially when programmers prefer developer-friendly output, such as XML, but senior management only speaks the language of GUI-assisted spreadsheets.

Frequently used acronyms

  • API: Application Programming Interface
  • GUI: Graphical User Interface
  • IDE: Integrated Development Environment
  • PNG: Portable Network Graphics
  • SQL: Structured Query Language
  • XML: Extensible Markup Language

Anxiety may increase greenhouse gas "output" by the programmers involved, especially if they start hyperventilating. Although reporting takes time and resources, it is essential for providing project visibility to upper management. Management might mistrust hard-working teams simply because the upper levels can't see what their teams are doing.

So, reporting is vital for the work environment.

A few Java tricks can increase a programmer's "miles per gallon" when working with reports from spreadsheet lovers. After walking through the steps of this article, an intermediate-level Java programmer should understand the basic principles behind converting data programmatically between Microsoft® Excel® and XML.


Excel to Java APIs

Tools used in this article

To work through the examples in this article, you need the following tools:

  • The XML Object Model (XOM) API version 1.2.1
  • Apache POI version 3.6, the Java API for Microsoft Office documents
  • Eclipse Classic version 3.5.1 (an IDE with Java Platform, Standard Edition [Java SE] version 6 built in)
  • Microsoft Excel

For more information about and links to these products, see Resources.

Several Java APIs manipulate Excel files. Which one is the best? It depends on your individual needs and level of experience.

Andy Khan's Java Excel API

Andy Khan created an API called the Java Excel API, or JExcel API (see Resources). This open source API can read and write Excel spreadsheets. Also, because it's lightweight, it's a good choice for beginning Java developers. A handy Unified Modeling Language (UML) diagram makes it easier to use and so does the API's support community.

The API also has a few disadvantages. Although it reads Microsoft Excel 95, Excel 97, Excel 2000, Excel 2002, and Excel 2003 file formats, it does not currently work with any of the newer Excel formats. It cannot create charts, graphs, or macros, and it supports only PNG files for images.

The xlSQL Excel JDBC Driver

Another Excel-to-Java API is the xlSQL Excel Java Database Connectivity (JDBC) driver (see Resources), an open source API for querying Excel files as if they were databases. With it, developers treat Excel sheets like tables in a database. If you're familiar with SQL and JDBC, this might be the simplest method of retrieving data. You can also add data by using SQL insert commands. Unfortunately, the xlSQL Excel JDBC driver doesn't appear to be currently or actively supported.

OpenXLS

OpenXLS (see Resources) is the open source version of a commercial product by Extentech called ExtenXLS. OpenXLS has extensive capabilities. It can programmatically modify formulas and use numerous formatting options. Unlike some of the open source products, it can work with complex objects, including named ranges, pivot tables, split frames, and charts. Also, compared to similar products, OpenXLS has more thorough upfront documentation about which features are available. Unfortunately, although the open source version supports Excel formats from 97 through 2003, only the commercial version supports Excel 2007.

Apache POI

Apache POI is a set of Java APIs for working with both older and newer Microsoft standard documents. In addition to working with Excel versions from 97 onward, the Apache POI can work with Microsoft Word and PowerPoint® files. You can leverage your knowledge of working with Excel files to learn more quickly how to work with these additional types of files. There is also an active community of support for the API. However, because Apache POI has so much functionality and can work with other files, it might be more than a developer trying to work with only Excel files needs to learn.

This article uses the Apache POI because of its support community and rich functionality.


Java XML APIs

XML is a popular data format, and there are several ways to work with XML in Java technology. You can choose whichever XML API works best for your project. However, this article uses Elliotte Rusty Harold's elegant XML API called XOM. For more information about XML APIs, see Resources.


The sample application

The sample application for this article starts with an Excel spreadsheet file provided by the Human Resources department of the fictional Planet Power corporation. The spreadsheet is called Employee_List.xls.

This article demonstrates how to use Java technology and the Apache POI to read from Employee_List.xls. The Java class file for the demonstration is ExcelReader.java and is included inside an Eclipse project. Download the .zip file containing the sample spreadsheet and Eclipse project in Downloads. A Readme.txt file in the Eclipse project explains additional sample code included in the download.

Part 2 of this article series will demonstrate how to convert the information to XML and create a new spreadsheet with a few modifications to the original data.

Setting up

To prepare your computer to run the samples in this article, complete the following steps:

  1. Download the Excel spreadsheet and sample code.
  2. Create the directory C:\Planet Power, and extract the files to it.
  3. Download Eclipse using the link in Resources. After downloading Eclipse, extract it to the C:\Program Files\Eclipse directory.
  4. From the link in Resources, download XOM using the Complete zip link on the XOM site. Then, extract the files to the C:\Program Files\Eclipse\lib directory.
  5. Download Apache POI using the link in Resources. Extract the files to C:\Program Files\Eclipse\lib. (You will need to create the lib directory.)

Now you're ready to start working in Eclipse.

Start Eclipse

To begin working in the Eclipse IDE, complete these steps:

  1. Start Eclipse by navigating to C:\Program Files\Eclipse\eclipse and double-clicking eclipse.exe. If Windows® displays a security warning, click Run.
  2. In the Workspace Launcher window, replace the path labeled Workspace with C:\Eclipse_Projects, and then click OK.
  3. When Eclipse finishes loading, click the Workbench icon on the right side of the window (see Figure 1).
    Figure 1. Workbench icon
    Screen capture of Workbench icon
  4. Right-click in the Package Explorer pane, and then click Import.
  5. Expand General, and select Existing Projects into Workspace. Click Next (see Figure 2).
    Figure 2. Bring an existing project into the workspace
    Bring an existing project into the workspace, screen capture of the Importdialog, Select option
  6. Click Browse (located beside Select root directory), and navigate to C:\Planet Power\Employees. Select the Employees folder, and then click OK.
  7. Click Finish, as in Figure 3.
    Figure 3. Import a project into Eclipse
    Import a project into Eclipse, screen capture of the Import dialog, Import Projects option

The Employees folder should now appear in the Package Explorer pane.

Making XOM and Apache POI available to Eclipse

Technically, the steps in this section have already been performed in the imported Employees Eclipse project. However, in case you start your own project from scratch, you will need to tell the Eclipse project to use the new XOM and Apache POI downloads. Complete these steps:

  1. Right-click the Employees folder in the Package Explorer, and then click Properties.
  2. Click Java Build Path in the left pane.
  3. Click the Libraries tab.
  4. Click Add External JARs, as in Figure 4.
    Figure 4. Add external JAR files to the build path
    Add the build path, screen capture of the Properties for Employees dialog
  5. Select the Java archive (JAR) file containing the parts of Apache POI you will use in this sample. (If you are using the same version of POI as this article, the path will be C:\Program Files\Eclipse\lib\poi-3.6\poi-3.6-20091214.jar). Click Open.
  6. Click Add External JARs again.
  7. Select the JAR file containing XOM (If you are using the same version of XOM as this article, the path will be C:\Program Files\Eclipse\lib\XOM\xom-1.2.1.jar). Click Open.
  8. Click OK.

Using the ExcelReader.java file

For this article, use the file ExcelReader.java, which resides in the Employees project folder beneath src\(default package). Figure 5 shows the file.

Figure 5. Opening the Employees project
Screen capture of the ExcelReader.java file open in Eclipse SDK

To run the file, click the Run arrow button at the top of the screen, as in Figure 6.

Figure 6. Running a Java file
Screen capture of showing Run arrow in the Eclipse SDK tool bar

Running ExcelReader.java reads information from the cells in the Employee_List.xls spreadsheet and displays it using Eclipse's Console tab, as in Figure 7.

Figure 7. The Java output in Eclipse's Console
Screen capture of the Console tab when running ExcelReader.java in Eclipse SDK

Getting started

The key to understanding Java technology is being familiar with the idea of working with objects and instantiating (that is, creating) those objects. The standard format for creating objects to work with is:

class objectName = new class();

The objectName is a name for the newly created object. It is like a variable that identifies and provides a way to work with that specific object. Also, information, often in the form of other existing objects, might be inside the parentheses (()) after class. The information inside the parentheses is used for creating the new object.

Working with files

Whenever you work with files in a Java environment, you might experience problems with the file. For example, it could be missing. So, trying to read a file could cause errors. Catch any exceptions that could be caused by manipulating files.

To work with Excel files, this article uses the FileInputStream class (java.io.FileInputStream). FileInputStream represents a file that might not be made of regular text. Because Excel files contain binary data, use FileInputStream instead of the FileReader class, which reads files containing only text characters.

Start programming

The first step in reading an Excel workbook is to prepare to use the Apache POI and other necessary classes. The classes required in ExcelReader.java include some Apache POI classes, some exception (error) classes, and some file handling classes. Listing 1 shows the code at the top of ExcelReader.java that imports these classes to make them available for use.

Listing 1. Importing classes (ExcelReader.java)
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;

After importing the relevant classes, you can begin to program inside the body of the main method using the Apache POI.


What does HSSF mean in Apache POI?

Some Classes for Reading Excel Spreadsheets

Here are a few classes used for reading Excel spreadsheets:

  • java.io.FileInputStream
  • java.io.IOException
  • org.apache.poi.hssf.usermodel.HSSFWorkbook
  • org.apache.poi.hssf.usermodel.HSSFSheet
  • org.apache.poi.hssf.usermodel.HSSFRow
  • org.apache.poi.hssf.usermodel.HSSFCell

The Apache POI API programmers chose an unusual naming convention for their classes involving Excel workbooks; they used the prefix HSSF. According to its JavaDocs, the prefix actually stands for Horrible Spread Sheet Format. In fact, according to Wikipedia, POI started out meaning Poor Obfuscation Implementation. Who says programmers don't know how to have fun?

The HSSF classes are for working with Excel versions before 2007 (that is, .xls files). A different set of classes—XSSF—is for Excel 2007 and later (that is, .xlsx files). Yet another set of classes—the SS classes—works with both versions. For the sake of simplicity, this article uses HSSF classes. Code samples using the other classes are identified in the Readme.txt file in the Eclipse project you downloaded.


Workbooks

The HSSF class that represents an Excel workbook in the Apache POI is org.apache.poi.hssf.usermodel.HSSFWorkbook. Pass HSSFWorkbook a FileInputStream into its constructor, and you can expect it to represent the file upon which the FileInputStream is based.

But wait! In the JavaDocs for the Apache POI, there is no constructor for HSSFWorkbook that states it can take a FileInputStream. Is using a FileInputStream an undocumented feature? No. There is a constructor that accepts an InputStream.

Because FileInputStream is a subclass of InputStream, technically it is also an InputStream, so it can be passed into the constructor. In fact, InputStream is abstract, so some sort of subclass is necessary. FileInputStream will do nicely.

When instantiating the FileInputStream, pass it a String describing the path to the Excel file to read. For Windows® files, escape any special characters in the file path, especially the directory-separator backslash (\). Use double backslashes (\\) to create one escaped backslash in the file path string.

The code in Listing 2 instantiates a new FileInputStream and then instantiates a new HSSFworkbook based on that FileInputStream.

Listing 2. Reading an Excel file (ExcelReader.java)
public static void main(String[] args) {
 // Create a FileInputStream to hold the file.
 // Use double back-slashes to create one "escaped" slash.
 // Use error handling (try/catch) around its creation in case
 // the file to read does not exist.
 // Be sure to import java.io.FileNotFoundException and java.io.IOException, or use 
 // the superclass IOException to handle both. 

 try {
    FileInputStream excelFIS = new FileInputStream("C:\\Planet Power\\Employee_List.xls");

    // Create an Excel Workbook Object using the FileInputStream created above
    // (which contains the file).
    // Use error handling around its creation in case of Input/Output Exception

    HSSFWorkbook excelWB = new HSSFWorkbook(excelFIS);

 }
 catch (IOException e) {
    System.out.println("Input/Output Exception!");
 }

//End Main Method
}

Now, staying within the try error handling statement, proceed to gather information from the Excel workbook, starting with its sheets.


Sheets and rows

A workbook can have several layers of pages called sheets. A sheet object is represented by the HSSFSheet class (org.apache.poi.hssf.usermodel.HSSFSheet).

How many sheets does a workbook have? To find out, you can use the method getNumberOfSheets() on the workbook. For this exercise, however, there's only one sheet, so using its number is simpler. The number of the first sheet is zero (computers like to count starting with zero instead of one). The code will look like Listing 3.

Listing 3. Get the sheet (ExcelReader.java)
      // Start by getting the Spreadsheet (Excel books can have several 
      // sheets). Assuming there is just one sheet, it's the zero sheet.

      HSSFSheet topSheet = excelWB.getSheetAt(0);

After obtaining the sheet object, move across the sheet and work with its data. Helpful methods and properties like these have names that indicate what they do:

  • HSSFSheet.getFirstRowNum() and getLastRowNum()
  • HSSFSheet.getHeader() and getFooter()
  • HSSFSheet.getRow()
  • HSSFSheet.getPhysicalNumberOfRows()

To work with data on the sheet, start by getting an HSSFRow (org.apache.poi.hssf.usermodel.HSSFRow) object, which represents a row in the sheet. One way to get a row is to use getRow() on the sheet and ask for the row by number, as in Listing 4.

Listing 4. Get the Row (ExcelReader.java)
      // getRow() returns an HSSFRow object, but the numbering
      // system is logical, not physical, and zero based.
      // for example, use getRow(2) to get the third row.

      HSSFRow thirdRow = topSheet.getRow(2);

Remember, topSheet is the sheet obtained earlier in Listing 3.

After obtaining the row from the sheet, use the row to drill down to the cell level.


Cells

To dig down to an individual cell's data, use the row to get an HSSFCell object (org.apache.poi.hssf.usermodel.HSSFCell) representing that cell. To get cell information that is in String format, use the getStringCellValue() method on the HSSFCell, as in Listing 5.

Listing 5. Get the cells and strings inside (ExcelReader.java)
      // Get the first two cells in the row
      HSSFCell lastnameCell = thirdRow.getCell(0);
      HSSFCell firstnameCell = thirdRow.getCell(1);

      // Get the string information in the cells
      String firstName = firstnameCell.getStringCellValue();
      String lastName = lastnameCell.getStringCellValue();

      // Print out the value of the cells
      System.out.println(firstName + " " + lastName);

To gather all of the information from the workbook, iterate through all of the sheets, each row in each sheet, and each cell in each row. But there is a catch: Try running the code below, and it works for some of the cells. However, it will quit with an error attempting to extract the cell value and print it out (see the comment in Listing 6). Why?

Listing 6. Loop through all cells and print out values. Broken!
// Traverse the sheets by looping through sheets, rows, and cells.
// Remember, excelWB is the workbook object obtained earlier.
// Outer Loop: Loop through each sheet

for (int sheetNumber = 0; sheetNumber < excelWB.getNumberOfSheets(); sheetNumber++) {
   HSSFSheet oneSheet = excelWB.getSheetAt(sheetNumber);

// Now get the number of rows in the sheet
   int rows = oneSheet.getPhysicalNumberOfRows();

   // Middle Loop: Loop through rows in the sheet

   for (int rowNumber = 0; rowNumber < rows; rowNumber++) {
      HSSFRow oneRow = oneSheet.getRow(rowNumber);

      // Skip empty (null) rows.
      if (oneRow == null) {
         continue;
      }

      // Get the number of cells in the row
      int cells = oneRow.getPhysicalNumberOfCells();

      // Inner Loop: Loop through each cell in the row

      for (int cellNumber = 0; cellNumber < cells; cellNumber++) {
         HSSFCell oneCell = oneRow.getCell(cellNumber);

         // Get the value of the string in the cell.
         // Print out the String value of the Cell 
         // This section will result in an error. Why?

         String cellValue = oneCell.getStringCellValue();
         System.out.println(cellValue + ", ");

      // End Inner Loop
      }
   // End Middle Loop
   }
// End Outer Loop
}

What goes wrong? The method getStringCellValue() only works for Strings. Thus the name.

Some of the cells contain numeric values. To avoid the error, test the cell's data type and use the appropriate method for getting that data type out of the cell. Use getCellType() to determine what type of data the cell contains. The type of data is returned as an integer that represents the data type. The following static fields (constants) represent the data types:

  • HSSFCELL.CELL_TYPE_STRING. Use getStringCellValue().
  • HSSFCELL.CELL_TYPE_FORMULA. Use getCellFormula().
  • HSSFCELL.CELL_TYPE_NUMERIC. Use getNumericCellValue().
  • HSSFCELL.CELL_TYPE_BOOLEAN. Use getBooleanCellValue().

The cell might also contain an Excel error. If so, getCellType() returns the integer HSSFCELL.CELL_TYPE_ERROR.

While iterating through the cells, test their data types, as in Listing 7.

Listing 7. Test for the cell value type (ExcelReader.java)
      // Inner Loop: Loop through each cell in the row

      for (int cellNumber = 0; cellNumber < cells; cellNumber++) {
         HSSFCell oneCell = oneRow.getCell(cellNumber);

         // Test the value of the cell.
         // Based on the value type, use the proper 
         // method for working with the value.

         // If the cell is blank, the cell object is null, so don't 
         // try to use it. It will cause errors.
         // Use continue to skip it and just keep going.

        if (oneCell == null) {
            continue;
         }

         switch (oneCell.getCellType()) {

         case HSSFCell.CELL_TYPE_STRING:
            System.out.println(oneCell.getStringCellValue());
            break;

         case HSSFCell.CELL_TYPE_FORMULA:
            System.out.println(oneCell.getCellFormula());
            break;

         case HSSFCell.CELL_TYPE_NUMERIC:
            System.out.println(oneCell.getNumericCellValue());
            break;

         case HSSFCell.CELL_TYPE_ERROR:
            System.out.println("Error!");
            break;
         }	

      // End Inner Loop
      }

When the code runs, notice that dates are displayed as numbers, not dates. That's because the format of the date is not stored in its value. It's a formatting choice for the cell. Part 2 of this article series will discuss how to preserve the format for dates.

One more important note: In the following lines of code from Listing 7, broken out in Listing 8 below, the code tests each cell to make sure it's not null.

Listing 8. Don't forget to test for null (ExcelReader.java)
// If the cell is blank, the cell object is null, so don't 
// try to use it. It will cause errors.
// Use continue to skip it and just keep going.

if (oneCell == null) {
   continue;
}

In a Java environment, if an object is null, trying to manipulate it causes an error. Before using objects like rows and cells, be sure to test them to make sure they are not null.


Conclusion

Armed with the basics of reading Excel spreadsheets, you can begin to convert Excel data into arrays, XML, or other formats to perform calculations or create new spreadsheets. Part 2 of this article series will demonstrate how to convert spreadsheet information to XML and create a new spreadsheet with modifications to the original data.

When finished with the code in these articles, be sure to recycle it for greener reporting.


Download

DescriptionNameSize
Sample Excel spreadsheet and Java codeJava-Excel-XML-Planet-Power.zip30KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=467940
ArticleTitle=Read, recycle, and reuse: Reporting made easy with Excel, XML, and Java technologies, Part 1
publish-date=03022010