Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Cultured Perl: Reading and writing Excel files with Perl

Using the Spreadsheet::WriteExcel and Spreadsheet::ParseExcel modules

Teodor Zlatanov (tzz@bu.edu), Programmer, Gold Software Systems
Author photo
Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java, C, and C++. His interests are in open source work on text parsing, 3-tier client-server database architectures, UNIX system administration, CORBA, and project management. Suggestions and corrections are welcome over e-mail. Contact Teodor at tzz@iglou.com.

Summary:  Only recently have the doors been open to Microsoft Excel, the most popular spreadsheet application for the desktop. This article takes a look at reading and writing Excel files in Windows and Linux, using Perl and a few simple modules. The author of this article, Teodor Zlatanov, is an expert in Perl who has been working in the community since 1992 and who specializes in, among other things, open source work in text parsing.

Date:  01 Sep 2001
Level:  Introductory
Also available in:   Japanese

Activity:  190958 views
Comments:  

Parsing Excel files presents a conundrum any way you look at it. Until last year, UNIX modules were completely unavailable, and data from Excel files for Windows could only be retrieved with the Win32::OLE modules. But things have finally changed, thanks to two Perl hackers and a lot of volunteer help and contributions!

Spreadsheet::WriteExcel and Spreadsheet::ParseExcel

In 2000, Takanori Kawai and John McNamara produced the Spreadsheet::WriteExcel and Spreadsheet::ParseExcel modules and posted them on CPAN, which made it possible, though not easy, to extract data from Excel files on any platform.

As we'll see later, Win32::OLE still offers a simpler, more reliable solution if you're working with Windows, and is recommended by the Spreadsheet::WriteExcel module for more powerful manipulations of data and worksheets. Win32::OLE comes with the ActiveState Perl toolkit, and can be used to drive a lot of other Windows applications through OLE. Note that to use this module, you still need to have the Excel engine (usually installed with Excel itself) installed and licensed on your machine.

The applications that need to parse Excel data number in the thousands, but here are a few examples: exporting Excel to CSV, interacting with a spreadsheet stored on a shared drive, moving financial data to a database for reporting, and analyzing data not provided in any other format.

To follow along with the examples given here, you must have Perl 5.6.0 installed on your system. Preferably, your system should be a recent (2000 or later) mainstream UNIX installation (Linux, Solaris, BSD). Although the examples may work with earlier versions of Perl and UNIX, and with other operating systems, you should consider cases where they fail to function as exercises to solve.


Windows example: parsing

This section applies to Windows machines only. All the other sections apply to Linux.

Before you proceed, install ActiveState Perl (version 628 used here) or the ActiveState Komodo IDE for editing and debugging Perl. Komodo comes with a free license for home users, which you can get in a matter of minutes. (See Resources later in this article for the download sites.)

Installing the Spreadsheet::ParseExcel and Spreadsheet::WriteExcel modules using the ActiveState PPM package manager is difficult. PPM has no history, options are hard to set, help scrolls off the screen, and the default is to install modules ignoring dependencies. You can invoke PPM from the command line by typing "ppm" and issuing the following commands:

ppm> install OLE::Storage_Lite
ppm> install Spreadsheet::ParseExcel
ppm> install Spreadsheet::WriteExcel

The module install will fail in this case, because IO::Scalar is not yet available, so you may want to give up trying to find the problem with PPM, and switch to the built-in Win32::OLE module. However, by the time you read this, ActiveState may have released a fix for this problem.

With Win32::OLE from the ActiveState toolkit, you can dump a worksheet, cell by cell, using the code listed below:

Download win32excel.pl

#!/usr/bin/perl -w

use strict;
use Win32::OLE qw(in with);
use Win32::OLE::Const 'Microsoft Excel';

$Win32::OLE::Warn = 3;                                # die on errors...

# get already active Excel application or open new
my $Excel = Win32::OLE->GetActiveObject('Excel.Application')
    || Win32::OLE->new('Excel.Application', 'Quit');  

# open Excel file
my $Book = $Excel->Workbooks->Open("c:/komodo projects/test.xls"); 

# You can dynamically obtain the number of worksheets, rows, and columns
# through the Excel OLE interface.  Excel's Visual Basic Editor has more
# information on the Excel OLE interface.  Here we just use the first
# worksheet, rows 1 through 4 and columns 1 through 3.

# select worksheet number 1 (you can also select a worksheet by name)
my $Sheet = $Book->Worksheets(1);

foreach my $row (1..4)
{
 foreach my $col (1..3)
 {
  # skip empty cells
  next unless defined $Sheet->Cells($row,$col)->{'Value'};

 # print out the contents of a cell  
  printf "At ($row, $col) the value is %s and the formula is %s\n",
   $Sheet->Cells($row,$col)->{'Value'},
   $Sheet->Cells($row,$col)->{'Formula'};        
 }
}

# clean up after ourselves
$Book->Close;

Note that you can assign values to cells very easily in the following way:

$sheet->Cells($row, $col)->{'Value'} = 1;


Linux example: parsing

This section applies to UNIX, and specifically Linux. It has not been tested under Windows.

It would be difficult to give a better example of parsing with Linux than the one provided in the documentation for the Spreadsheet::ParseExcel module, so I will show that example and then explain how it works.

Download parse-excel.pl


Listing 3: parse-excel.pl
#!/usr/bin/perl -w

use strict;
use Spreadsheet::ParseExcel;

my $oExcel = new Spreadsheet::ParseExcel;

die "You must provide a filename to $0 to be parsed as an Excel file" unless @ARGV;

my $oBook = $oExcel->Parse($ARGV[0]);
my($iR, $iC, $oWkS, $oWkC);
print "FILE  :", $oBook->{File} , "\n";
print "COUNT :", $oBook->{SheetCount} , "\n";

print "AUTHOR:", $oBook->{Author} , "\n"
 if defined $oBook->{Author};

for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++)
{
 $oWkS = $oBook->{Worksheet}[$iSheet];
 print "--------- SHEET:", $oWkS->{Name}, "\n";
 for(my $iR = $oWkS->{MinRow} ;
     defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ;
     $iR++)
 {
  for(my $iC = $oWkS->{MinCol} ;
      defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ;
      $iC++)
  {
   $oWkC = $oWkS->{Cells}[$iR][$iC];
   print "( $iR , $iC ) =>", $oWkC->Value, "\n" if($oWkC);
  }
 }
}

This example was tested with Excel 97. If it does not work, try converting to the Excel 97 format. The perldoc page for Spreadsheet::ParseExcel claims Excel 95 and 2000 compatibility as well.

The spreadsheet is parsed into a top-level object called $oBook. $oBook has properties to aid the program, such as "File," "SheetCount," and "Author." The properties are documented in the perldoc page for Spreadsheet::ParseExcel, in the workbook section.

The workbook contains several worksheets; iterate through them by using the workbook SheetCount property. Each worksheet has a MinRow and MinCol and corresponding MaxRow and MaxCol properties, which can be used to figure out the range the worksheet can access. The properties are documented in the perldoc page for Spreadsheet::ParseExcel, in the worksheet section.

Cells can be obtained from a worksheet through the Cells property; that's how the $oWkC object is obtained in Listing 3. Cell properties are documented in the perldoc page for Spreadsheet::ParseExcel, in the Cell section. There does not seem to be a way, according to the documentation, to obtain the formula listed in a particular cell.


Linux example: writing

This section applies to UNIX, and specifically Linux. It has not been tested under Windows.

Spreadsheet::WriteExcel comes with a lot of example scripts in the Examples directory, usually found under /usr/lib/perl5/site_perl/5.6.0/Spreadsheet/WriteExcel/examples. It may have been installed elsewhere; consult with your local Perl administrator if you can't find that directory.

The bad news is that Spreadsheet::WriteExcel can not be used to write to an existing Excel file. You have to import data from an existing Excel file yourself, using Spreadsheet::ParseExcel. The good news is that Spreadsheet::WriteExcel is compatible with Excel 5 up to Excel 2000.

Here's a program that will demonstrate how data can be extracted from an Excel file, modified (all the numbers are multiplied by 2), and written to a new Excel file. Only the data is preserved, without formatting or any properties. Formulas are dropped.

Download excel-x2.pl


Listing 4: excel-x2.pl
#!/usr/bin/perl -w

use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;
use Data::Dumper;

# cobbled together from examples for the Spreadsheet::ParseExcel and
# Spreadsheet::WriteExcel modules

my $sourcename = shift @ARGV;
my $destname = shift @ARGV or die "invocation: $0 <source file> <destination file>";

my $source_excel = new Spreadsheet::ParseExcel;

my $source_book = $source_excel->Parse($sourcename)
 or die "Could not open source Excel file $sourcename: $!";

my $storage_book;

foreach my $source_sheet_number (0 .. $source_book->{SheetCount}-1)
{
 my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];

 print "--------- SHEET:", $source_sheet->{Name}, "\n";

 # sanity checking on the source file: rows and columns should be sensible
 next unless defined $source_sheet->{MaxRow};
 next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
 next unless defined $source_sheet->{MaxCol};
 next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};

 foreach my $row_index ($source_sheet->{MinRow} .. $source_sheet->{MaxRow})
 {
  foreach my $col_index ($source_sheet->{MinCol} .. $source_sheet->{MaxCol})
  {
   my $source_cell = $source_sheet->{Cells}[$row_index][$col_index];
   if ($source_cell)
   {
    print "( $row_index , $col_index ) =>", $source_cell->Value, "\n";

    if ($source_cell->{Type} eq 'Numeric')
    {
  $storage_book->{$source_sheet->{Name}}->{$row_index}->{$col_index} = $source_cell->Value*2;
    }
    else
    {
  $storage_book->{$source_sheet->{Name}}->{$row_index}->{$col_index} = $source_cell->Value;
    } # end of if/else
   } # end of source_cell check
  } # foreach col_index
 } # foreach row_index
} # foreach source_sheet_number

print "Perl recognized the following data (sheet/row/column order):\n";
print Dumper $storage_book;

my $dest_book  = Spreadsheet::WriteExcel->new("$destname")
 or die "Could not create a new Excel file in $destname: $!";

print "\n\nSaving recognized data in $destname...";

foreach my $sheet (keys %$storage_book)
{
 my $dest_sheet = $dest_book->addworksheet($sheet);
 foreach my $row (keys %{$storage_book->{$sheet}})
 {
  foreach my $col (keys %{$storage_book->{$sheet}->{$row}})
  {
   $dest_sheet->write($row, $col, $storage_book->{$sheet}->{$row}->{$col});
  } # foreach column
 } # foreach row
} # foreach sheet

$dest_book->close();

print "done!\n";

It is noteworthy that the data extraction and storage parts of the program are forcibly separated. They could have been done at the same time, but by separating them, bug fixes and improvements can be easily made.

A much better solution to the problem above could be achieved with the XML::Excel CPAN module, but a special converter from XML back to Excel would have to be written. You can also use the DBI interface through the DBD::Excel module, if you want to import data that way. Finally, Spreadsheet::ParseExcel comes with the Spreadsheet::ParseExcel::SaveParser module, which claims to convert between two Excel files but comes with no documentation or examples. My Web site (see Resources) shows an example of using SaveParser. Be forewarned that this is experimental and highly combustible.


Conclusion

If you are using a Windows machine, stick with the Win32::OLE modules unless you don't have Excel at all on your machine. Win32::OLE is the easiest way to get Excel data right now, although the Spreadsheet::WriteExcel and Spreadsheet::ParseExcel modules are catching up.

On UNIX, especially Linux, go with the Spreadsheet::WriteExcel and Spreadsheet::ParseExcel modules for programmatic access to Excel data. But be forewarned that these are fairly young modules, and they may not be perfect for you if you need stability.

You may also consider packages like Gnumeric and StarOffice (see Resources), which are freely available and offer a full GUI interface and import/export capabilities for Excel files. These are useful if you don't need programmatic access to the Excel data. I have used both applications and find them wonderful for day-to-day tasks.


Resources

About the author

Author photo

Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java, C, and C++. His interests are in open source work on text parsing, 3-tier client-server database architectures, UNIX system administration, CORBA, and project management. Suggestions and corrections are welcome over e-mail. Contact Teodor at tzz@iglou.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, Open source
ArticleID=11153
ArticleTitle=Cultured Perl: Reading and writing Excel files with Perl
publish-date=09012001
author1-email=tzz@bu.edu
author1-email-cc=