Skip to main content

Manipulate OpenOffice.org's XML-based document formats

Simple steps to edit OpenDocument Format (ODF) text files to perform common or tedious tasks

Roger McCoy (rogermccoy@gmail.com), IT Specialist, Freelance
Roger McCoy is a developer who has worked with many different programming languages, including C, Java, JavaScript, Perl, PHP, Visual Basic and many others. He has five years of experience developing PHP applications but is perhaps best known for his work as a technician in the call center industry.

Summary:  In this article, learn to take advantage of the compressed Extensible Markup Language (XML) format used by OpenOffice.org and similar programs to automate document editing. Learn to dissect OpenOffice.org's OpenDocument Format (ODF) text files and make changes to your documents using scripts or simple search-and-replace functions.

Date:  26 Jun 2007
Level:  Introductory
Activity:  686 views
Comments:  

Things you can't do with your average office suite

You've run into it at some point. The office suite that you use has 500,000 features, just not the one that you need. No matter how many features are added to any particular office suite, you can always find a tedious task that, if automated, will save you endless repetition.

For example: OpenOffice.org is great about generating references for indexes. You can generate hundreds or thousands of references in a few keystrokes. But what happens when you want to remove them? You have to delete them one at a time. This can be a pain if you decide to change the keywords that you want indexed.

You can probably come up with a few examples of other procedures yourself. Depending on your office suite, you might be able to handle some problems programmatically using StarOffice Basic or Microsoft® Visual Basic® for Applications (VBA), but that's contingent on you already having a good knowledge of those languages. What if you're not a programmer? Or what if you do know how to program but realize that it will simply take more time than it's worth?

Depending on your problem, you might even be able to do a quick fix without programming, thanks to the increasingly popular XML document formats. Using your existing knowledge of XML, common XML extensions (such as namespaces), and common XML-based file formats (such as Extensible Hypertext Markup Language [XHTML] and Scalable Vector Graphics [SVG]), you can quickly make massive changes to documents that are not easily accomplished in the office suite. The ODF used in OpenOffice.org is particularly handy for this, so I'll focus on OpenOffice formats in my examples.

The OpenDocument Format

The ODF has the benefit of extreme simplicity. As used by OpenOffice.org, the format consists of a simple Java™ Archive File (JAR) file, which is basically a compressed (.zip) file with a manifest included. This compressed file contains a series of XML files that describe different parts of the document.

Look at an OpenOffice.org text document as an example. Using OpenOffice.org (See Resources for download information.), create a new text document. Type Hello world!, and then save and close the document. Make sure the document is saved in OpenDocument Text (ODT) format (using an .odt extension).

Again, OpenOffice.org text documents are basically compressed files, so the simplest way for most people to take one apart is:

  1. Copy the document (so as not to damage the original).
  2. Rename the copy with a .zip extension.
  3. Extract the compressed file with your favorite compression utility (such as unzip, WinZip, or Microsoft Windows® Explorer).

When you extract the compressed file, the contents should be roughly as follows:

  • Configurations2 (directory)
  • META-INF (directory)
  • Thumbnails (directory)
  • content.xml
  • meta.xml
  • mimetype
  • settings.xml
  • styles.xml

The directories don't contain anything that you're likely to need to edit. In fact, the only files that I'll bother with for the moment are described in Table 1.


Table 1. ODF text document files
FileDescription
content.xmlContains all of the document text, as well as index markers, links to style information, and more. It is the bulk of the document.
meta.xmlContains file metadata, such as the author and document title.
styles.xmlDefines formatting on text, such as changes in font, paragraph direction, page style, and so on. If you're familiar with Web design, styles.xml fills the place of your Cascading Style Sheets (CSS) style sheet. The ODF keeps style as separate from content as possible, so you won't find descriptions of any of this information mixed in with content.xml. You'll only find links from the content to the style.
Thumbnails/thumbnail.pngProvides a quick thumbnail image of the first page of the document.

Understanding the XML

As you would expect for any reasonably well-designed XML format, the document files aren't particularly difficult to understand. The XML tags are generally given names that are self explanatory, so you can largely guess what they mean. (Of course, you can always read the ODF documentation if you don't want to guess. See Resources for more information.) Table 2 provides a few examples.


Table 2. XML tags
TagDescription
<office:document-content>The root tag. Notice all the XML namespaces (including office) are defined in this tag.
<office:font-face-decls>Contains a list of all of the fonts used in the document.
<office:automatic-styles>Contains a list of very basic styles. The styles.xml file elaborates on these.
<office:body>Contains the document itself.
<text:p>Mirrors the <p> tag in HTML, surrounding entire paragraphs.
<text:span>Mirrors the <span> tag in HTML, allowing you to assign styles to specific subsections of paragraphs.
<text:alphabetical-index-mark>,
<text:alphabetical-index-mark-start>, and
<text:alphabetical-index-mark-end>
Identify index entries.

Making a simple change

If you haven't already, extract content.xml from the compressed file and open it in your favorite text editor to try this simple experiment that will give you the feel for editing. You should see a line like this toward the end of the document:

<text:p text:style-name="Standard">Hello world!</text:p>

Change it to the following:

<text:p text:style-name="Standard">Goodbye cruel world!</text:p>

Go ahead and save the modified file. After you save it, update the compressed file with the new copy and rename it with the ODT extension. Now reopen the file with OpenOffice.org. You should see a file identical to your original with the exception of the modified text.

If everything worked okay, feel free to skip to the next section. However, if the document didn't open correctly, there are three quick things you can check for:

1. Make sure you didn't corrupt the XML.

If you erased a closing tag, typed a tag name incorrectly, or perhaps overtyped a symbol, such as a less than sign (<), you might run into this problem. This is probably the most common issue when you make edits, and it's a good reason to keep your original file safe and secure.

2. Make sure you saved the XML as plain text (that is, 8-bit Unicode Transformation Format [UTF-8]).

If you use a simple text editor, this shouldn't be an issue. However, if you choose to use OpenOffice.org itself (or another rich-text editor, such as Microsoft Office WordPad, Microsoft Office Word, or WordPerfect), make sure that you didn't save in a document format that preserves formatting. If this is the issue, you can usually correct it by using Save as type > Text document when you save your XML file.

3. Make sure you didn't alter the compression.

I don't know if this is still an issue with the latest versions of OpenOffice.org, but occasionally I've had problems where I recompressed a file at a different compression level than the original, and OpenOffice.org wasn't quite sure what to make of it. If this problem still exists, you can probably avoid it by updating the existing compressed file rather than creating a new compressed file of the same documents.

Removing index tags

Now take a look at the project that I mentioned earlier: removing unnecessary index tags. Listing 1 is a sample content.xml file with several index tags in place.


Listing 1. Content.xml using index tags
<text:p text:style-name="P3">A lot of people would think that
you&apos;re out of your<text:alphabetical-index-mark
text:string-value="gourd" text:key1="mind" text:key2="noggin"/>
mind if you suggest such a thing. Some would say that the ubiquitous
<text:alphabetical-index-mark-start text:id="IMark100896128"/>Microsoft
<text:alphabetical-index-mark-end text:id="IMark100896128"/>
<text:alphabetical-index-mark-start text:id="IMark101662028"/>Office
<text:alphabetical-index-mark-end text:id="IMark101662028"/> is one of
the greatest examples of &quot;feature creep&quot; of any program,
with competitors like Corel and Open<text:alphabetical-index-mark-start
text:id="IMark101661388"/> Office<text:alphabetical-index-mark-end
text:id="IMark101661388"/>.org containing a comparable number of features.

Notice the two different types of index tags here. When OpenOffice.org assigns a word in the document as the tag itself, it surrounds it with the <text:alphabetical-index-mark-start> and <text:alphabetical-index-mark-end> tags. The two tags are assigned the same text:id property to identify that they match each other.

The other type of tag is used when the word appearing in the index is not the same word that appears in the text. It might also appear if multiple keywords are associated with a specific location. In this case, gourd has been entered that way. You can see that it is contained within the <text:alphabetical-index-mark> tag (with no -start or -end).

This makes for a simple enough way to get rid of all of the indexes. A search-and-replace action using wildcards is all you need to correct the issue. You can use many different tools for this, but I think it's only appropriate to fix the problem with OpenOffice.org's regular expression engine:

  1. Open content.xml in OpenOffice.org.
  2. Open the Find & Replace dialog box and open the More Options section.
  3. Check the regular expressions box.
  4. Enter <text:alphabetical-index-mark(-(start|end))? [^>]*/> in the search box. Leave the Replace with section blank.
  5. Click Replace all.
  6. Save and close the document keeping it in the original text format.

That's it! After the file is replaced in the ODT file, you'll notice that all of the index references were removed from the document. This allows you to recreate the index using OpenOffice.org's tools without duplicating tags or including tags that you no longer want.

It's worth taking a moment to note the find-and-replace function that was used above. <text:alphabetical-index-mark(-(start|end))? [^>]*/> is a pattern that effectively means any text:alphabetical-index-mark, text:alphabetical-index-mark-start, or text:alphabetical-index-mark-end tag and all of its contents. You can also write this more simply as <text:alphabetical-index-mark [^>]*/>, but then you risk catching additional, similarly named tags, if there are any.

Notice that this method is not foolproof. For example, if the text of an index marker contains the greater than sign (>), the function believes that the tag has ended and doesn't remove the entire tag. Potential issues like this are a good reason to hold on to that backup from earlier. If problems like this occur, performing a one-by-one search-and-replace function might still be the fastest method to remove these tags. Alternatively, you can write a more complex regular expression to deal with the problem.

Pivoting spreadsheet data

Here's another example of a common task. You have a row of spreadsheet data that you need to turn on its side. For example, you need to turn this:

1    2    3    4    5    6    7    8    9    10

into this:

1
2
3
4
5
6
7
8
9
10

If you edit the XML, you can do this to hundreds of cells in a row with one simple find-and-replace function. Extract the files as you did earlier, and look at content.xml. It should contain text something like that shown in Listing 2 (I reformatted a bit for readability):


Listing 2. Content.xml from spreadsheet
<office:body>
   <office:spreadsheet>
   <table:table table:name="Sheet1" table:style-name="ta1" table:print="false">
         <table:table-column table:style-name="co1"
      table:number-columns-repeated="10"
      table:default-cell-style-name="Default"/>
            <table:table-row table:style-name="ro1">
               <table:table-cell office:value-type="float" office:value="1">
                  <text:p>1</text:p>
               </table:table-cell>
               <table:table-cell office:value-type="float" office:value="2">
                  <text:p>2</text:p>
               </table:table-cell>
               <table:table-cell office:value-type="float" office:value="3">
                   <text:p>3</text:p>
               </table:table-cell>

The easiest way to pivot your data is to simply end each row before describing the next cell. Thus, search for </table:table-cell> in your text editor and replace it with </table:table-cell></table:table-row><table:table-row>. This terminates each row and starts a new one after every cell.

Benefits of XML documents

This article provided simple examples of what you can accomplish if you directly edit ODF documents. Simple text editing can often provide you with simple features that your office suite might lack. Beyond this, knowledge of manipulating XML through programming languages can allow you to accomplish almost anything you can think of, from massive automated changes to creating a new document. You can even use an existing file as a template and use simple shell scripts and tools to generate fancy reports using ODF text files, spreadsheets, and so on. The possibilities are limited only by what you can think of. Happy editing!


Resources

Learn

Get products and technologies

  • OpenOffice.org: Download the product and find all sorts of information about it.

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

About the author

Roger McCoy is a developer who has worked with many different programming languages, including C, Java, JavaScript, Perl, PHP, Visual Basic and many others. He has five years of experience developing PHP applications but is perhaps best known for his work as a technician in the call center industry.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=227226
ArticleTitle=Manipulate OpenOffice.org's XML-based document formats
publish-date=06262007
author1-email=rogermccoy@gmail.com
author1-email-cc=dwxed@us.ibm.com