"PDF for the server" was one of the more popular columns in this series. More precisely, it's the one that inspired the most e-mail in response. Several readers asked that Server clinic treat Microsoft Word documents the same way: describe how to manage them programmatically.
It's important to do so. Few office workers get the point of automation, despite the large investments Microsoft and others have made in scripting, "active documents," and related technologies. "Civilians" are largely habituated to routines of typing in data that come from computer print-outs. I see plenty of workplaces where it's unusual even to question such practices.
On the other side, many of the system programmers with the expertise to help end-users integrate complex work-flows don't regard Microsoft Word formats as feasible targets for server-side programming. Commercial document management packages are available, but only for costs in the range of $20,000 and up.
In fact, there's plenty you can do with Word documents on a Linux or other UNIX server, at a modest cost. Consider the possibilities:
First, for quick human readability, rough word counts, and so on, it's
often enough to scan a .DOC document with
strings. A command such as
strings something.doc | wc -w
generally returns a word count within 10% of the correct one.
It's surprisingly difficult to improve on that crude approach. The
heart of the problem is that .DOC, as a format,
has changed quite a bit through the years. It's hard to track.
The related .RTF has several advantages:
it's encoded in ASCII and is almost human-readable, and it's far less
likely to bear effective infections. Moreover, it appears to have been
less volatile through the years; a reader from 1997 will probably be able
to digest a .RTF written this year, and
vice-versa. On some of the networks I manage, I restrict traffic to
exclude .DOC in favor of .RTF as prophylaxis against malicious code. In
principle, this deprives users of certain word-processing features only
available with .DOC. As a practical matter, I
have never had a user who actually uses an effect that can't be
achieved with .RTF.
The Resources section, below, lists several lightweight Word readers:
wvWare, catdoc, and so on. These are generally quick and easy both to
install and use. Most UNIX desktop users now know that OpenOffice-on-UNIX
is entirely viable as a replacement for common uses of Windows Word, and
is quite adept at both reading and writing .DOC
documents. OpenOffice exposes scriptable interfaces which make it feasible
to program document content, either in Java, C++, Python, OpenOffice.org
Basic, StarScript, CORBA, or OLE Automation. OpenOffice integrates macro
recording with this technology, also. Essentially the same is true of the
StarOffice(TM) commercially-licensed product.
In fact, while StarOffice is formally distinct from OpenOffice, this column focuses entirely on the latter, as "[f]uture versions of StarOffice software, beginning with 6.0, will be built using the OpenOffice.org source, APIs [application programming interfaces], file formats, and reference implementation", according to the latter's Web site (see Resources). In future OpenOffice implementations, "UNO (Universal Network Objects) is the interface-based component model."
OpenOffice is quite a "heavy" way to work with Word documents, though.
It requires at least graphical user interface (GUI) service, and often a
rather delicate installation, and multiple programmed processes.
XML-oriented "formatting objects" (FO) is much the same: while powerful,
it demands quite a bit of machinery be in place before it begins to work.
If you're out to do the simple kinds of operations I most often encounter
-- generate an .RTF invoice of a fixed format,
"scrape" an incoming weekly status report, customize a Web download with
reader-specific information, that sort of thing -- you should look into
the direct language bindings of .RTF libraries.
Best among these is Robert Rothenburg's Perl API.
For the simplest .RTF generations, simple
cutting and pasting is enough. You can parameterize production of a form
like Figure 1 with a shell script.
Listing 1. Source code for invoice.sh (partial)
#!/bin/sh
AMOUNT="1234.56"
DATE="06 October 2002"
NUMBER="9999/3333"
PO="6543"
FORM="{\rtf1\ansi\deff0\deftab720{\fonttbl...
\par \pard\plain\f3\fs20
\par \pard\qr\plain\f2\fs24\cf0 $DATE
\par \pard\plain\f2\fs24\cf0 Phaseit, Inc.
\par #$NUMBER
\par
\par Please pay \$$AMOUNT to
...
|
Figure 1. Screen shot of simple Word document generated on Linux server

For more structured, scalable, and maintainable programming, use Perl's RTF modules. These make it possible to write code as in Listing 2.
Listing 2. Source code for invoice.pl (partial)
use RTF::Document;
$rtf = new RTF::Document({
doc_page_width => '8.5in',
doc_page_height => '11in'
});
$fCourier = $rtf->add_font ("Courier",
{ family=>monospace, pitch=>fixed,
alternates=>["Courier New", "American Typewriter"]
}
);
$fTime s= $rtf->add_font ("Times New Roman",
{ default => 1
}
);
$rtf->add_text( $rtf->root(), "Invoice", ...
|
With this approach, of course, I have all the power and productivity of Perl immediately at hand to tap into external data sources, transform content, and so on.
Don't count on problems to fix themselves. Part of your responsibility as a server-side developer is to be on the prowl for frictions in the operations around you. If there's a report that frequently gets lost, or miscoded, one approach is to exhort employees to put in longer hours, or to be more careful. Sometimes that works. With automation tools, though, you can systematically engineer effective processes.
Your automation can do even more than just reduce error. When you automate content generation or processing, you open up new possibilities for customization and qualitatively better service. Pick the resources below that best fit your situation, use them to solve problems already consuming time in your organization, and move on to more interesting and rewarding challenges.
- Check out the other installments of Server clinic.
- "Rich Text Format (RTF) Specification, version 1.6"
is a document Microsoft first published in 1999.
- Antiword is a
free MS Word reader for Linux and RISC operating systems. Some commercial
UNIX distributions also include a proprietary reader of common
Windows file formats. As these are not available for Linux,
this column doesn't describe them further.
- The
catdoc Word reader
is compact and straightforward.
-
wvWare is a library for
converting Word documents.
- Read Cameron's article "PDF for the server" (developerWorks, September 2002).
- The CPAN RTF
directory includes Perl modules both to parse and generate RTF documents.
While author Robert Rothenberg labels them "experimental" and "alpha",
they can be quite useful even in production situations.
-
Docserver
is a Perl-coded application that renders
.DOCand related formats into more standard text formats. It depends on a (licensed) installation of Microsoft Office running on a Windows host accessible through the network. - The Open Office home page
and Star Office home page
lead to plenty of information about working with
.DOCand.RTF. - The UNO Development Kit Project
describes the OpenOffice approach to scripting. For more details, see the UNO technical documents.
-
NuxDocument
is a "Zope product" that converts from Microsoft Word and other
formats into HTML and plain text.
Zope is the popular
Python-based content management and application server.
-
Windward Reports
is a commercial Java-coded product that includes
.RTF->{XML,TXT,HTML,...} functionality. Windward is not FO-based (xsl:fo), although it has a similar external appearance. -
HtmlToHlp
is a Java-coded conversion tool that translates HTML files to RTF.
- "Using XSL-FO to create printable documents" introduces FO and touches on its potential to work with RTF (developerWorks, November 2001).
- Find more resources for Linux developers in the developerWorks Linux zone.



