 | Level: Intermediate Chris C. Gruber (gruber@ca.ibm.com), IBM Technical Marketing, Developer Initiatives, IBM
16 Aug 2007 Integrate your ODF and Microsoft® Office 2007 documents into your enterprise and Internet applications more easily than ever before with IBM® DB2® 9. Review older methods of data interchange with MS Office documents, and learn how MS Office 2007 offers better data interchange. This article discusses interchange with DB2 9 XQuery, Zend Core for IBM®, HP: Hypertext Preprocessor (PHP), and PHP Data Objects (PDO) technologies.
Introduction
What's new with desktop documents? A great deal, if you follow the Microsoft Windows® community. Microsoft has introduced a new format for MS Office 2007 products that leverages XML internally.
You could exchange data between MS Office applications before; but MS Office 2007 and Windows Vista® allow enterprises to tap into islands of data that were historically very difficult to consume and re-purpose (that is, to utilize data intended for one purpose for another purpose).
While Microsoft is pushing this concept, OpenOffice applications have implemented ODF for some time and benefit from the XML advantages already identified in the marketplace.
Let's look at a typical desktop application problem. Consider a company whose marketing department is planning a conference.
They ask prospective speakers to submit proposals in a MS Word template document, and the document contents are then used for the following tasks:
- Reviewing submissions
- E-mailing the potential speakers of successful submissions
- Posting related information on a conference Web site
- Publishing content through a new feed (ATOM/RSS) for the conference or company
Traditionally, marketing staff would cut text from the Word document and paste the information into the various systems, which was time consuming. Retaining the document's original format and extracting relevant information for specific tasks saves substantial time and effort. Using XQuery, you can do this with a surprisingly small amount of code.
In this article, we'll briefly review older document data interchange methods for MS Office applications. Then, we'll look at the new format and discuss how this format can be interchanged and re-purposed. We'll use cross platform technologies like Zend Core for IBM, PHP, PDO and XQuery -- all tools you can use with DB2.
This technique can be a valuable part of content management and document management solutions. After reading this article, I hope you will find that consuming and re-purposing ODF and MS Office 2007 documents is easy with IBM DB2 pureXML™ features. Not much code is required, so this solution is easy to implement. Later, you may want to index these Office 2007 documents to take further advantage of the pureXML hybrid storage features.
Overview of desktop applications and XML
This article focuses on the OpenOffice default format ODF and MS Office 2007 formats. You should be aware that the concepts presented here are not limited to these formats and vendors. In fact, you'll find ODF in a variety of vendors including Google Applications, KOffice, and StarOffice. It is worth noting that Microsoft has a translator "plug-in" for consuming and importing ODF formats and that there is a compatibility pack that can open and save MS Office 2007 file formats with older versions of Office. Furthermore, Lotus Notes and Corel have plans for the ODF format as well.
Wikipedia discusses the ODF format and its use in desktop applications as well as providing information about the supporters of ODF.
Resources required
While you can download Apache 2.0, PHP Version 5.21 or greater, DB2 Express-C, and the DB2 extensions for PHP, downloading and installing Zend Core for IBM is a better option. The installer completely configures your environment for PHP and DB2. The installer should include links for downloading DB2 Express-C 9.1.2. You may need to tweak the configuration to add ZIP support, which we'll discuss later.
Old document data interchange
Data Application Programming Interfaces (APIs) on the Windows operating system started with Data Access Objects (DAO). They progressed from DAO (not Dead On Arrival) to Remote Data Objects (RDO), and then Open Database Connectivity (ODBC). The transition to ODBC was great and marked a milestone in access. It enabled a common means for developers to code to one API regardless of the database. Using ODBC, you can import relational data into MS Office applications through the likes of Mail Merge.
The introduction of the Object Linking and Embedding Data Base (OLE/DB), which enables data interchange regardless of data source, only confused the market. For MS Office productivity tools, this meant that you could embed documents within each other. In fact, you see this today when pasting a spreadsheet into a presentation. However, the OLE/DB is a Windows-only solution. Dissecting documents was not very easy with these technologies, and automating the process was difficult.
An overview of the OpenOffice and the new MS Office 2007 format
These formats are effectively a ZIP file containing resources and a folder for XML documents. The actual text content of a Word document is stored as an XML file. In our sample document, it's found in "word/document.xml" within the archive. This file format allows developers to capture the true data within these documents.
While OpenOffice has been using this format for sometime, MS Office 2007 just recently introduced their new document format. Microsoft touts that this format is accessible and easily shared. MS Office documents have been proprietary for many years, and it is refreshing that the Office team within Microsoft is opening the doors. See the ZIP file contents provided below, in the Downloads section, for ODF:
Figure 1. Contents of opendoc.odt within WinZip
See the Microsoft format below, note that there are some differences but the concept and format is similar:
Figure 1.1 Contents of submit.docx within WinZip
Setting up the database
Of course, you'll need to create a table. You'll store the XML within the new Word format within the XML column rather than in a Character Large Object (CLOB). To do this, we'll need to create an XML-enabled or UTF-8 database.
Listing 1. Creating an XML-enabled database
CREATE DATABASE ODF AUTOMATIC STORAGE YES USING CODESET UTF-8 TERRITORY US
|
Now, you'll want to connect to the database and create a table with the following definition. The code below assumes you have a username of "db2admin" with CREATETAB authorities. If you do not have this user setup, you can adjust the scripts for an appropriate user on your database.
Listing 2. Creating an XML-enabled table to contain our ODF and MS Office 2007 documents
CREATE TABLE DB2ADMIN.DOCUMENT (
ID INT NOT NULL PRIMARY KEY,
OWNER VARCHAR(128),
DOC XML
)
|
Setting up PHP and Zend Core for IBM
The code will use PHP's ZIP utilities and PDO. PDO is already set up for you with Zend Core for IBM, but you need to add "zip" to the configuration:
- Open the Zend Core Administration Console.
- Switch to the Configuration tab and Extensions sub-tab.
- Scroll down to the ZIP extension.
- Click on the switch icon to turn it on.
- Click on Save Settings.
- Restart Apache2.
- If it's not already running as a red feather in the system tray, start C:\Program Files\Zend\Apache2\bin\ApacheMonitor.exe.
- Click on it, and restart the Apache2 HTTP Server from the menu.
If this does not work for you, try editing the file and make sure the following lines are found in php.ini within the "C:\Program Files\Zend\Core for IBM\etc" folder. Add the line in bold below to your php.ini:
Listing 3. Modifications for php.ini
extension=php_zip.dll
extension=php_pdo.dll
extension=php_pdo_ibm.dll
extension_dir="c:\program files\Zend\Core for IBM\lib\phpext"
|
Consuming ODF and MS Office 2007 documents
Now, you're ready to consume the opendoc.odt which is an ODF file included in the download. While Figure 2 illustrates a MS Word images of the document, this ODF file would appear similar to Listing 4. The document provided utilizes styles like "Heading 1" and "Heading 2" for formatting. Later, we'll query against the Heading 2 style as it is used in OpenOffice format. Note that in production code you will want to include proper error handling among other things.
I have also included a document submit.docx file, which uses the MS Office 2007 format in the download. Figure 2 represents what this document looks like when viewed in Word. Like ODF, the document utilizes Word styles like "Heading 1" and "Heading 2" for formatting. Later, we'll query against the Heading 2 style.
Figure 2. Sample document as shown in OpenOffice or MS Word
Let's consume this document with the following PHP code, which effectively loops through the entries in the manifest of the ZIP and extracts the XML document with relevant data. Listing 4 provides the code. Save this code under the filename "odfconsume.php" for reference purposes.
Listing 4. PHP code to consume OpenOffice (ODF) format (odfconsume.php)
<?php
$user = "db2admin";
$password = "secret";
$zip = zip_open("c:\opendoc.odt");
$db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
"HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
echo $user." Connected\n";
if ($zip) {
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry, "r")
&& zip_entry_name($zip_entry) == 'content.xml' ) {
$buf = zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
$insstr ="INSERT INTO DB2ADMIN.DOCUMENT (ID,OWNER,DOC)
VALUES (2,'Linux Office',:buf)";
$stmt = $db->prepare( $insstr );
$stmt->bindParam( ':buf' , $buf , PDO::PARAM_LOB , strlen($buf) );
$stmt->execute();
echo "\nResult: ".$db->errorCode()."\n";
zip_entry_close($zip_entry);
}
echo "\n";
}
zip_close($zip);
}
?>
|
Run this code with the following command line:
This inserts an XML document in the database and prints out an error code. Hopefully, the code will be zero.
Listing 4.1 is the code for consuming MS Word. Save this code under the filename "msconsume.php" for reference purposes.
Listing 4.1 PHP code to consume MS Office 2007 format (msconsume.php)
<?php
$user = "db2admin";
$password = "secret";
$zip = zip_open("c:\submit.docx");
$db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
"HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
echo $user." Connected\n";
if ($zip) {
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry, "r") &
zip_entry_name($zip_entry) == 'word/document.xml' ) {
$buf = zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
$insstr ="INSERT INTO DB2ADMIN.DOCUMENT VALUES (1,'BILLY ONAIRE',:buf)";
$stmt = $db->prepare( $insstr );
$stmt->bindParam( ':buf' , $buf , PDO::PARAM_LOB , strlen($buf) );
$stmt->execute();
echo "Result: ".$db->errorCode();
zip_entry_close($zip_entry);
}
echo "\n";
}
zip_close($zip);
}
?>
|
Like above, run this code with the following command line:
This inserts an XML document in the database and prints out an error code. Again, hopefully the code will be zero.
Re-purposing ODF and Word 2007 content
Now that you have the contents of the OpenOffice and Word 2007 documents in the database, you need to re-purpose this document for your Web site. This is what the Marketing department would have to do in our example. The code for ODF is in Listing 5, which you can save as "odfrepurpose.php" for reference purposes. From the output (an HTML snippet), you can easily see how to reformat this into a news feed or incorporate it into an Ajax application!
Listing 5. Re-purposing the ODF format (opendoc.odt) into a simple HTML paragraph
<?php
$user = "db2admin";
$password = "secret";
$db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
"HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
echo $user." Connected\n";
$xqry =
<<<TXT
values(
XMLSERIALIZE( XMLQUERY('
declare boundary-space strip;
declare namespace text0="urn:oasis:names:tc:opendocument:xmlns:text:1.0";
declare namespace office0="urn:oasis:names:tc:opendocument:xmlns:office:1.0";
for \$t0 in db2-fn:xmlcolumn("DB2ADMIN.DOCUMENT.DOC")
/office0:document-content/office0:body
let \$p0 := \$body0/office0:text/text0:p
let \$h0 := \$body0/office0:text/text0:h
where fn:exists(\$p0) or \$h0:h/@text:style-name ="Heading_20_2"
return
if ( fn:exists( \$style ) ) then
if (\$h0:h/@text:style-name ="Heading_20_2" ) then <h1>{\$txt}</h1>
else ()
else <p>{\$txt}</p>')
as VARCHAR(2000)))
TXT;
$result=$db->query( $xqry );
$arr = $result->fetch();
echo $arr[1];
?>
|
The code for Word 2007 is found in Listing 5.1, which you can save as "msrepurpose.php" for reference purposes. From the output (an HTML snippet), you can easily see how to reformat this into a news feed or incorporate it into an Ajax application!
Listing 5.1 Re-purposing the Word 2007 format (submit.docx) into a simple HTML paragraph
<?php
$user = "db2admin";
$password = "secret";
$db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=ODF;" .
"HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
echo $user." Connected\n";
$xqry =
<<<TXT
values(
XMLSERIALIZE( XMLQUERY('
declare boundary-space strip;
declare namespace text0=
"urn:oasis:names:tc:opendocument:xmlns:text:1.0";
declare namespace office0=
"urn:oasis:names:tc:opendocument:xmlns:office:1.0";
for \$t0 in db2-fn:xmlcolumn("DB2ADMIN.DOCUMENT.DOC")
/office0:document-content/office0:body/office0:text
let \$p := (for \$pp in \$t0/text0:p return <p>{\$pp/text()}</p>)
let \$h := (for \$hh in \$t0/text0:h[@text0:style-name=
"Heading_20_2"]
return <hl>{\$hh/text()}</hl>)
return
(\$h,\$p)')
as varchar(3000)))
TXT;
echo "\n";
$result=$db->query( $xqry );
$arr = $result->fetch();
echo $arr[1];
?>
|
In order to run the code, enter:
 | |
Tip: I was assisted in building this XQuery statement by DB2 Developer Workbench XQuery support. I managed to build the query by way of point-and-click. This saved a fair bit of time. |
|
Now you may ask, why wouldn't you just save this as HTML? The XML generation could jeopardize the integrity of the original document and lose MS Office features. With that said, the document does get much smaller and you can still re-purpose it with XQuery, as with the Word 2007 file format. As your documents get bigger and there are more of them (such as in the case of big conferences), you'll want to apply this logic for each section of each document in your repertoire; then the power of XQuery really shines through! Like SQL, XQuery lets you handle sets of documents with one query. In fact, you can bring the results into SQL queries again if you so wish because DB2 can interchange the languages with DB2.
The following listing illustrates the result of the odfrepurpose.php or msrepurpose.php:
Listing 6. The output
db2admin Connected
<h1>Introduction to Web 2.0</h1><p>This is a document
that will impress upon ...</p><h1>Abstract</h1><p>The nature of the industry is a
gain turning to the browser ...</p>
|
Conclusion
The techniques discussed in this article can be useful in content management and document management solutions. Hopefully, you've found that consuming and re-purposing OpenOffice and MS Office 2007 documents is easy with IBM DB2 pureXML features. There really is not much code involved, which is conducive to great performance. Consider indexing these XML documents later to take further advantage of the pureXML hybrid storage features.
To learn more on XQuery, work with the XQuery editor in DB2 Developer Workbench or browse
DB2 9: pureXML Overview and Fast Start
(IBM Redbooks). See the IBM_PDO site on pecl.php.net for more information on PDO.
Download | Description | Name | Size | Download method |
|---|
| Sample PHP scripts and submit.docx file | mngingodf.zip | 16KB | HTTP |
|---|
Resources Learn
Get products and technologies
-
DB2 Express-C can be freely distributed with your applications.
-
Zend Core for IBM bundles Apache, PHP, and all the necessary modules for DB2 with a convenient administration console.
-
Build your next development project with
IBM
trial software, available for download directly from developerWorks.
Discuss
About the author  | 
|  | With more than thirteen years of industry experience, Chris Gruber, Technical Manager for Developer Initiatives, works with IBM Information Management on industry-leading Data Server products. Previously, Chris has worked as a Senior Product Manager for Sybase iAnywhere Solutions. He has worked very closely with engineering and development partners accelerating partners time to the market. |
Rate this page
|  |