Manage ODF and Microsoft Office 2007 documents with DB2 9 pureXML

Storing and re-purposing data with PHP's PDO and XQuery using IBM DB2 9

Integrate your ODF and Microsoft® Office 2007 documents into your enterprise and Internet applications more easily than ever before with IBM® DB2® 9. Review older methods of data interchange with MS Office documents, and learn how MS Office 2007 offers better data interchange. This article discusses interchange with DB2 9 XQuery, Zend Core for IBM®, HP: Hypertext Preprocessor (PHP), and PHP Data Objects (PDO) technologies.

Share:

Chris C. Gruber (gruber@ca.ibm.com), IBM Technical Marketing, Developer Initiatives, Federated Integration Test team, IBM China Development Lab 

Chris GruberWith more than fifteen years of industry experience, Chris Gruber, Technical Manager for Developer Initiatives, works with IBM Information Management on industry-leading Data Server products. Previously, Chris has worked as a Senior Product Manager for Sybase iAnywhere Solutions. He has worked very closely with engineering and development partners accelerating partners time to the market. He currently works at IBM Information Management as Product Manager in Web 2.0 product lines.



16 August 2007

Also available in Chinese Russian

Introduction

What's new with desktop documents? A great deal, if you follow the Microsoft Windows® community. Microsoft has introduced a new format for MS Office 2007 products that leverages XML internally. You could exchange data between MS Office applications before; but MS Office 2007 and Windows Vista® allow enterprises to tap into islands of data that were historically very difficult to consume and re-purpose (that is, to utilize data intended for one purpose for another purpose). While Microsoft is pushing this concept, OpenOffice applications have implemented ODF for some time and benefit from the XML advantages already identified in the marketplace.

Let's look at a typical desktop application problem. Consider a company whose marketing department is planning a conference. They ask prospective speakers to submit proposals in a MS Word template document, and the document contents are then used for the following tasks:

  • Reviewing submissions
  • E-mailing the potential speakers of successful submissions
  • Posting related information on a conference Web site
  • Publishing content through a new feed (ATOM/RSS) for the conference or company

Traditionally, marketing staff would cut text from the Word document and paste the information into the various systems, which was time consuming. Retaining the document's original format and extracting relevant information for specific tasks saves substantial time and effort. Using XQuery, you can do this with a surprisingly small amount of code.

In this article, we'll briefly review older document data interchange methods for MS Office applications. Then, we'll look at the new format and discuss how this format can be interchanged and re-purposed. We'll use cross platform technologies like Zend Core for IBM, PHP, PDO and XQuery -- all tools you can use with DB2.

This technique can be a valuable part of content management and document management solutions. After reading this article, I hope you will find that consuming and re-purposing ODF and MS Office 2007 documents is easy with IBM DB2 pureXML™ features. Not much code is required, so this solution is easy to implement. Later, you may want to index these Office 2007 documents to take further advantage of the pureXML hybrid storage features.

Overview of desktop applications and XML

This article focuses on the OpenOffice default format ODF and MS Office 2007 formats. You should be aware that the concepts presented here are not limited to these formats and vendors. In fact, you'll find ODF in a variety of vendors including Google Applications, KOffice, and StarOffice. It is worth noting that Microsoft has a translator "plug-in" for consuming and importing ODF formats and that there is a compatibility pack that can open and save MS Office 2007 file formats with older versions of Office. Furthermore, Lotus Notes and Corel have plans for the ODF format as well.

Wikipedia discusses the ODF format and its use in desktop applications as well as providing information about the supporters of ODF.

Resources required

While you can download Apache 2.0, PHP Version 5.21 or greater, DB2 Express-C, and the DB2 extensions for PHP, downloading and installing Zend Core for IBM is a better option. The installer completely configures your environment for PHP and DB2. The installer should include links for downloading DB2 Express-C 9.1.2. You may need to tweak the configuration to add ZIP support, which we'll discuss later.

Old document data interchange

Data Application Programming Interfaces (APIs) on the Windows operating system started with Data Access Objects (DAO). They progressed from DAO (not Dead On Arrival) to Remote Data Objects (RDO), and then Open Database Connectivity (ODBC). The transition to ODBC was great and marked a milestone in access. It enabled a common means for developers to code to one API regardless of the database. Using ODBC, you can import relational data into MS Office applications through the likes of Mail Merge.

The introduction of the Object Linking and Embedding Data Base (OLE/DB), which enables data interchange regardless of data source, only confused the market. For MS Office productivity tools, this meant that you could embed documents within each other. In fact, you see this today when pasting a spreadsheet into a presentation. However, the OLE/DB is a Windows-only solution. Dissecting documents was not very easy with these technologies, and automating the process was difficult.

An overview of the OpenOffice and the new MS Office 2007 format

These formats are effectively a ZIP file containing resources and a folder for XML documents. The actual text content of a Word document is stored as an XML file. In our sample document, it's found in "word/document.xml" within the archive. This file format allows developers to capture the true data within these documents.

While OpenOffice has been using this format for sometime, MS Office 2007 just recently introduced their new document format. Microsoft touts that this format is accessible and easily shared. MS Office documents have been proprietary for many years, and it is refreshing that the Office team within Microsoft is opening the doors. See the ZIP file contents provided below, in the Downloads section, for ODF:

Figure 1. Contents of opendoc.odt within WinZip
WinZip file illustrating that there are XML documents with a Word 2007 document

See the Microsoft format below, note that there are some differences but the concept and format is similar:

Figure 1.1 Contents of submit.docx within WinZip
WinZip file illustrating that there are XML documents with a Word 2007 document

Setting up the database

Of course, you'll need to create a table. You'll store the XML within the new Word format within the XML column rather than in a Character Large Object (CLOB). To do this, we'll need to create an XML-enabled or UTF-8 database.

Listing 1. Creating an XML-enabled database
CREATE DATABASE ODF AUTOMATIC STORAGE YES USING CODESET UTF-8 TERRITORY US

Now, you'll want to connect to the database and create a table with the following definition. The code below assumes you have a username of "db2admin" with CREATETAB authorities. If you do not have this user setup, you can adjust the scripts for an appropriate user on your database.

Listing 2. Creating an XML-enabled table to contain our ODF and MS Office 2007 documents
CREATE TABLE DB2ADMIN.DOCUMENT (
         ID INT NOT NULL PRIMARY KEY,
         OWNER VARCHAR(128),
         DOC XML
       )

Setting up PHP and Zend Core for IBM

The code will use PHP's ZIP utilities and PDO. PDO is already set up for you with Zend Core for IBM, but you need to add "zip" to the configuration:

  1. Open the Zend Core Administration Console.
  2. Switch to the Configuration tab and Extensions sub-tab.
  3. Scroll down to the ZIP extension.
  4. Click on the switch icon to turn it on.
  5. Click on Save Settings.
  6. Restart Apache2.
    1. If it's not already running as a red feather in the system tray, start C:\Program Files\Zend\Apache2\bin\ApacheMonitor.exe.
    2. Click on it, and restart the Apache2 HTTP Server from the menu.

If this does not work for you, try editing the file and make sure the following lines are found in php.ini within the "C:\Program Files\Zend\Core for IBM\etc" folder. Add the line in bold below to your php.ini:

Listing 3. Modifications for php.ini
extension=php_zip.dll
extension=php_pdo.dll
extension=php_pdo_ibm.dll
extension_dir="c:\program files\Zend\Core for IBM\lib\phpext"

Consuming ODF and MS Office 2007 documents

Now, you're ready to consume the opendoc.odt which is an ODF file included in the download. While Figure 2 illustrates a MS Word images of the document, this ODF file would appear similar to Listing 4. The document provided utilizes styles like "Heading 1" and "Heading 2" for formatting. Later, we'll query against the Heading 2 style as it is used in OpenOffice format. Note that in production code you will want to include proper error handling among other things.

I have also included a document submit.docx file, which uses the MS Office 2007 format in the download. Figure 2 represents what this document looks like when viewed in Word. Like ODF, the document utilizes Word styles like "Heading 1" and "Heading 2" for formatting. Later, we'll query against the Heading 2 style.

Figure 2. Sample document as shown in OpenOffice or MS Word
Word 2007 document as displayed in Word 2007

Let's consume this document with the following PHP code, which effectively loops through the entries in the manifest of the ZIP and extracts the XML document with relevant data. Listing 4 provides the code. Save this code under the filename "odfconsume.php" for reference purposes.

Listing 4. PHP code to consume OpenOffice (ODF) format (odfconsume.php)
<?php
$user = "db2admin";
$password = "secret";
$zip = zip_open("c:\opendoc.odt");


$db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
  "HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
echo $user." Connected\n";

if ($zip) {

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry, "r")
        && zip_entry_name($zip_entry) == 'content.xml' ) {

            $buf = zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            $insstr ="INSERT INTO DB2ADMIN.DOCUMENT (ID,OWNER,DOC)
            VALUES (2,'Linux Office',:buf)";

            $stmt = $db->prepare( $insstr );
            $stmt->bindParam( ':buf' , $buf , PDO::PARAM_LOB , strlen($buf) );
            $stmt->execute();

            echo "\nResult: ".$db->errorCode()."\n";

            zip_entry_close($zip_entry);

        }
        echo "\n";
    }
    zip_close($zip);

}
?>

Run this code with the following command line:

 php odfconsume.php

This inserts an XML document in the database and prints out an error code. Hopefully, the code will be zero.

Listing 4.1 is the code for consuming MS Word. Save this code under the filename "msconsume.php" for reference purposes.

Listing 4.1 PHP code to consume MS Office 2007 format (msconsume.php)
<?php
  $user = "db2admin";
  $password = "secret";	
  $zip = zip_open("c:\submit.docx");
  $db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
    "HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
  echo $user." Connected\n";


  if ($zip) {

     while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry, "r") & 
                 zip_entry_name($zip_entry) == 'word/document.xml' ) {

            $buf = zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
	    $insstr ="INSERT INTO DB2ADMIN.DOCUMENT VALUES (1,'BILLY ONAIRE',:buf)";
	    $stmt = $db->prepare( $insstr );
	    $stmt->bindParam( ':buf' , $buf , PDO::PARAM_LOB , strlen($buf) );
  	    $stmt->execute();
	    echo "Result: ".$db->errorCode();
            zip_entry_close($zip_entry);
        }
        echo "\n";
    }
    zip_close($zip);

  }

?>

Like above, run this code with the following command line:

 php msconsume.php

This inserts an XML document in the database and prints out an error code. Again, hopefully the code will be zero.

Re-purposing ODF and Word 2007 content

Now that you have the contents of the OpenOffice and Word 2007 documents in the database, you need to re-purpose this document for your Web site. This is what the Marketing department would have to do in our example. The code for ODF is in Listing 5, which you can save as "odfrepurpose.php" for reference purposes. From the output (an HTML snippet), you can easily see how to reformat this into a news feed or incorporate it into an Ajax application!

Listing 5. Re-purposing the ODF format (opendoc.odt) into a simple HTML paragraph
<?php
$user = "db2admin";
$password = "secret";	

$db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
  "HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
echo $user." Connected\n";


$xqry = 
<<<TXT
values( 
  XMLSERIALIZE( XMLQUERY(' 
     declare boundary-space strip;
     declare namespace text0="urn:oasis:names:tc:opendocument:xmlns:text:1.0";
     declare namespace office0="urn:oasis:names:tc:opendocument:xmlns:office:1.0";

    for \$t0 in db2-fn:xmlcolumn("DB2ADMIN.DOCUMENT.DOC")
    /office0:document-content/office0:body
       let \$p0 := \$body0/office0:text/text0:p
       let \$h0 := \$body0/office0:text/text0:h
       where fn:exists(\$p0) or \$h0:h/@text:style-name ="Heading_20_2"
       return  
         if ( fn:exists( \$style ) ) then 
            if (\$h0:h/@text:style-name ="Heading_20_2" )  then <h1>{\$txt}</h1>
            else  () 
         else <p>{\$txt}</p>') 
  as VARCHAR(2000)))
TXT;

$result=$db->query( $xqry );
$arr = $result->fetch();
echo $arr[1];

?>

The code for Word 2007 is found in Listing 5.1, which you can save as "msrepurpose.php" for reference purposes. From the output (an HTML snippet), you can easily see how to reformat this into a news feed or incorporate it into an Ajax application!

Listing 5.1 Re-purposing the Word 2007 format (submit.docx) into a simple HTML paragraph
                <?php
                    $user = "db2admin";
                    $password = "secret";
                    $db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
                    "HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
                    echo $user." Connected\n";
                    $xqry =
                <<<TXT
                    values( 
                    XMLSERIALIZE( XMLQUERY('
                    declare boundary-space strip;
                    declare namespace text0=
                    "urn:oasis:names:tc:opendocument:xmlns:text:1.0"; 
                    declare namespace office0=
                    "urn:oasis:names:tc:opendocument:xmlns:office:1.0"; 
                    
                    for \$t0 in db2-fn:xmlcolumn("DB2ADMIN.DOCUMENT.DOC")
                    /office0:document-content/office0:body/office0:text
                    let \$p := (for \$pp in \$t0/text0:p return <p>{\$pp/text()}</p>)
                    let \$h := (for \$hh in \$t0/text0:h[@text0:style-name=
                    "Heading_20_2"] 
                    return <hl>{\$hh/text()}</hl>)
                    return
                    
                    (\$h,\$p)')
                    as varchar(3000)))
                    TXT;
                    
                    echo "\n";
                    $result=$db->query( $xqry );
                    
                    $arr = $result->fetch();
                    echo $arr[1];
                ?>

In order to run the code, enter:

php.exe msrepurpose.php

Tip: I was assisted in building this XQuery statement by DB2 Developer Workbench XQuery support. I managed to build the query by way of point-and-click. This saved a fair bit of time.

Now you may ask, why wouldn't you just save this as HTML? The XML generation could jeopardize the integrity of the original document and lose MS Office features. With that said, the document does get much smaller and you can still re-purpose it with XQuery, as with the Word 2007 file format. As your documents get bigger and there are more of them (such as in the case of big conferences), you'll want to apply this logic for each section of each document in your repertoire; then the power of XQuery really shines through! Like SQL, XQuery lets you handle sets of documents with one query. In fact, you can bring the results into SQL queries again if you so wish because DB2 can interchange the languages with DB2.

The following listing illustrates the result of the odfrepurpose.php or msrepurpose.php:

Listing 6. The output
db2admin Connected
<h1>Introduction to Web 2.0</h1><p>This is a document
 that will impress upon ...</p><h1>Abstract</h1><p>The nature of the industry is a
gain turning to the browser ...</p>

Conclusion

The techniques discussed in this article can be useful in content management and document management solutions. Hopefully, you've found that consuming and re-purposing OpenOffice and MS Office 2007 documents is easy with IBM DB2 pureXML features. There really is not much code involved, which is conducive to great performance. Consider indexing these XML documents later to take further advantage of the pureXML hybrid storage features.

To learn more on XQuery, work with the XQuery editor in DB2 Developer Workbench or browse DB2 9: pureXML Overview and Fast Start (IBM Redbooks). See the IBM_PDO site on pecl.php.net for more information on PDO.


Download

DescriptionNameSize
Sample PHP scripts and submit.docx file mngingodf.zip16KB

Resources

Learn

Get products and technologies

  • DB2 Express-C can be freely distributed with your applications.
  • Zend Core for IBM bundles Apache, PHP, and all the necessary modules for DB2 with a convenient administration console.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML, Open source
ArticleID=226250
ArticleTitle=Manage ODF and Microsoft Office 2007 documents with DB2 9 pureXML
publish-date=08162007