Skip to main content

Server clinic: RTF on the server

Automate document handling with low-cost server processes

Cameron Laird (claird@phaseit.net), Vice president, Phaseit, Inc.
Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks frequently on Open Source and other technical topics.

Summary:  Office workers habitually exchange documents encoded in Microsoft Word .DOC format. An abundance of open-source tools make it feasible to automate management of their content.

Date:  19 Dec 2002
Level:  Introductory
Activity:  1204 views

"PDF for the server" was one of the more popular columns in this series. More precisely, it's the one that inspired the most e-mail in response. Several readers asked that Server clinic treat Microsoft Word documents the same way: describe how to manage them programmatically.

It's important to do so. Few office workers get the point of automation, despite the large investments Microsoft and others have made in scripting, "active documents," and related technologies. "Civilians" are largely habituated to routines of typing in data that come from computer print-outs. I see plenty of workplaces where it's unusual even to question such practices.

On the other side, many of the system programmers with the expertise to help end-users integrate complex work-flows don't regard Microsoft Word formats as feasible targets for server-side programming. Commercial document management packages are available, but only for costs in the range of $20,000 and up.

In fact, there's plenty you can do with Word documents on a Linux or other UNIX server, at a modest cost. Consider the possibilities:

Simplest first

First, for quick human readability, rough word counts, and so on, it's often enough to scan a .DOC document with strings. A command such as

strings something.doc | wc -w

generally returns a word count within 10% of the correct one.

It's surprisingly difficult to improve on that crude approach. The heart of the problem is that .DOC, as a format, has changed quite a bit through the years. It's hard to track.

The related .RTF has several advantages: it's encoded in ASCII and is almost human-readable, and it's far less likely to bear effective infections. Moreover, it appears to have been less volatile through the years; a reader from 1997 will probably be able to digest a .RTF written this year, and vice-versa. On some of the networks I manage, I restrict traffic to exclude .DOC in favor of .RTF as prophylaxis against malicious code. In principle, this deprives users of certain word-processing features only available with .DOC. As a practical matter, I have never had a user who actually uses an effect that can't be achieved with .RTF.

The Resources section, below, lists several lightweight Word readers: wvWare, catdoc, and so on. These are generally quick and easy both to install and use. Most UNIX desktop users now know that OpenOffice-on-UNIX is entirely viable as a replacement for common uses of Windows Word, and is quite adept at both reading and writing .DOC documents. OpenOffice exposes scriptable interfaces which make it feasible to program document content, either in Java, C++, Python, OpenOffice.org Basic, StarScript, CORBA, or OLE Automation. OpenOffice integrates macro recording with this technology, also. Essentially the same is true of the StarOffice(TM) commercially-licensed product.

In fact, while StarOffice is formally distinct from OpenOffice, this column focuses entirely on the latter, as "[f]uture versions of StarOffice software, beginning with 6.0, will be built using the OpenOffice.org source, APIs [application programming interfaces], file formats, and reference implementation", according to the latter's Web site (see Resources). In future OpenOffice implementations, "UNO (Universal Network Objects) is the interface-based component model."

OpenOffice is quite a "heavy" way to work with Word documents, though. It requires at least graphical user interface (GUI) service, and often a rather delicate installation, and multiple programmed processes. XML-oriented "formatting objects" (FO) is much the same: while powerful, it demands quite a bit of machinery be in place before it begins to work. If you're out to do the simple kinds of operations I most often encounter -- generate an .RTF invoice of a fixed format, "scrape" an incoming weekly status report, customize a Web download with reader-specific information, that sort of thing -- you should look into the direct language bindings of .RTF libraries. Best among these is Robert Rothenburg's Perl API.


RTF::Documents

For the simplest .RTF generations, simple cutting and pasting is enough. You can parameterize production of a form like Figure 1 with a shell script.


Listing 1. Source code for invoice.sh (partial)
				   
    #!/bin/sh
    
    AMOUNT="1234.56"
    DATE="06 October 2002"
    NUMBER="9999/3333"
    PO="6543"
    
    FORM="{\rtf1\ansi\deff0\deftab720{\fonttbl...
    \par \pard\plain\f3\fs20 
    \par \pard\qr\plain\f2\fs24\cf0 $DATE
    \par \pard\plain\f2\fs24\cf0 Phaseit, Inc.
    \par #$NUMBER
    \par 
    \par Please pay \$$AMOUNT to
	...
    


Figure 1. Screen shot of simple Word document generated on Linux server
Screen shot of simple Word document generated on Linux server

For more structured, scalable, and maintainable programming, use Perl's RTF modules. These make it possible to write code as in Listing 2.


Listing 2. Source code for invoice.pl (partial)
				   
     use RTF::Document;
     
     $rtf = new RTF::Document({
         doc_page_width => '8.5in',
         doc_page_height => '11in'
     });                 
     $fCourier = $rtf->add_font ("Courier",
         { family=>monospace, pitch=>fixed,
           alternates=>["Courier New", "American Typewriter"]
         }
     ); 
     $fTime s= $rtf->add_font ("Times New Roman",
         { default => 1
         }
     ); 
     
     $rtf->add_text( $rtf->root(), "Invoice", ...    
    

With this approach, of course, I have all the power and productivity of Perl immediately at hand to tap into external data sources, transform content, and so on.


Summary

Don't count on problems to fix themselves. Part of your responsibility as a server-side developer is to be on the prowl for frictions in the operations around you. If there's a report that frequently gets lost, or miscoded, one approach is to exhort employees to put in longer hours, or to be more careful. Sometimes that works. With automation tools, though, you can systematically engineer effective processes.

Your automation can do even more than just reduce error. When you automate content generation or processing, you open up new possibilities for customization and qualitatively better service. Pick the resources below that best fit your situation, use them to solve problems already consuming time in your organization, and move on to more interesting and rewarding challenges.


Resources

  • Check out the other installments of Server clinic.

  • "Rich Text Format (RTF) Specification, version 1.6" is a document Microsoft first published in 1999.



  • Antiword is a free MS Word reader for Linux and RISC operating systems. Some commercial UNIX distributions also include a proprietary reader of common Windows file formats. As these are not available for Linux, this column doesn't describe them further.



  • The catdoc Word reader is compact and straightforward.



  • wvWare is a library for converting Word documents.



  • Read Cameron's article "PDF for the server" (developerWorks, September 2002).



  • The CPAN RTF directory includes Perl modules both to parse and generate RTF documents. While author Robert Rothenberg labels them "experimental" and "alpha", they can be quite useful even in production situations.



  • Docserver is a Perl-coded application that renders .DOC and related formats into more standard text formats. It depends on a (licensed) installation of Microsoft Office running on a Windows host accessible through the network.



  • The Open Office home page and Star Office home page lead to plenty of information about working with .DOC and .RTF.



  • The UNO Development Kit Project describes the OpenOffice approach to scripting. For more details, see the UNO technical documents.



  • NuxDocument is a "Zope product" that converts from Microsoft Word and other formats into HTML and plain text. Zope is the popular Python-based content management and application server.



  • Windward Reports is a commercial Java-coded product that includes .RTF->{XML,TXT,HTML,...} functionality. Windward is not FO-based (xsl:fo), although it has a similar external appearance.



  • HtmlToHlp is a Java-coded conversion tool that translates HTML files to RTF.



  • "Using XSL-FO to create printable documents" introduces FO and touches on its potential to work with RTF (developerWorks, November 2001).

  • Find more resources for Linux developers in the developerWorks Linux zone.

About the author

Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks frequently on Open Source and other technical topics.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11272
ArticleTitle=Server clinic: RTF on the server
publish-date=12192002
author1-email=claird@phaseit.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers