Skip to main content

skip to main content

developerWorks  >  Linux  >

Server clinic: PDF for the server

Automate generation of professional-quality output

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Introductory

Cameron Laird (claird@phaseit.net), Vice president, Phaseit, Inc.

17 Sep 2002

PDF is the recognized standard for several categories of top-quality displayable output. While most programmers regard it as a "desktop" technology, a format that a content specialist chooses through a SaveAs operation, you can make your document management processes more powerful through server-side automation of PDF creation. This month, Cameron introduces the ReportLab library for PDF management and programming.

You know PDF. When someone in marketing wants a brochure that looks "just so," or legal needs a document that shouldn't be changed, they publish it as Portable Document Format (PDF). PDF is a standard defined by Adobe Systems for platform-independent, device-independent rendering and display of documents. PDF builds on the fantastic success of Adobe's PostScript (PS), first released in 1984 to improve the printing sophistication possible with common hardware. In principle, PDF has a fixed appearance, invariant across different Web browsers and different devices including printers; the content of PDF documents is "locked down."

While neither of these propositions is strictly true, they're close enough for most purposes. Moreover, PDF generally prints well; only a plain text document is more likely to be compatible with any particular printer.

What does that have to do with you? As a systems or server-side programmer, perhaps you think of PDF as just another opaque content type. Your desktop users or document specialists occasionally update instances on your servers, and you serve up the files just as you would any others. That, you say, should be the limit of your involvement.

Programmatic PDF generation

That model misses out on several interesting server-side possibilities for processing PDF, though. When you automate generation of PDF, you can begin to use all the techniques of software engineering: version control, abstraction, professional-quality backups, regression tests, and so on. Programmatic PDF generation means you can customize deliverables in a manageable way. Perhaps your organization's habit with PDF is to have someone adept with a particular desktop word processor set up a "mail merge" sort of operation to parameterize document output. Automation can reach far deeper, though.

Desktop software vendors have a partial appreciation of this. Several word-processing or desktop-publishing packages have scripting capabilities that reach at least part of the way to PDF. Some shops create PostScript images and transform them into PDF with Ghostscript or similar packages.

My favorite way to automate PDF generation, though, is with one of three actively maintained open source libraries: ReportLab, PJ, and PDFlib. They're all roughly comparable, and I've had medium to good success on projects that relied on each. Pointers to all three, along with several other tools, appear in Resources, below.

Among these, ReportLab is the one I currently use most: it handles the multi-megabyte PDFs with which I work, its exposure of Python as a scripting language suits me, its library includes all the functionality I need for daily work, and the ReportLab company behind the library appears to enjoy sustainable business. Moreover, its convenient integration into the Python interactive shell makes for a delightfully productive development environment. The rest of this month's "Server clinic" illustrates how you can start to program PDF.



Back to top


PDF's "Hello, world"

While you probably have a good Python installation on your servers already, Python.org's download page can help assure you're current. Version 2.2.1 is a good choice.

With Python installed, you need to visit the ReportLab Download page before you begin your PDF programming career. Even over slow connections, downloads and installations of both Python and the ReportLab Toolkit take well under an hour (see Resources for links to both downloads) .

The source code for your first application can be as simple as this:


Source code for a "Hello, world" page
		
      from reportlab.pdfgen import canvas
      from reportlab.lib.units import inch

      font = "Helvetica"
      font_size = 26
      text = "Hello, world"
      x = 5.0 * inch
      y = 8.0 * inch
      destination_file = "/tmp/first.pdf"

      my_canvas = canvas.Canvas(destination_file)
      my_canvas.setFont(font, font_size)
      my_canvas.drawRightString(x, y, text)
      my_canvas.save()

This code simply puts a headline on an otherwise blank piece of paper. While mundane, it hints at new horizons: font style and size, content, and formatting are all programmable. When your organization decides to publish in Times New Roman rather than Helvetica, you can, in principle, change one configuration assignment and regenerate everything, rather than having to open each of thousands of documents, alter them, and write them back out. The same is true for other effects: if you want to expand the typeface on information targeted to older readers, for instance, your application can automate that.

Don't think you have to develop your own word processor to accomplish anything meaningful, though. While the ReportLab library is broad and deep enough to allow that, it also supports a couple of specific shortcuts that enormously simplify my PDF programming. First is the import_HTML method. This renders valid HTML source into PDF pages. For many applications, I find it convenient to prototype in HTML, get "stakeholder sign-off" for a sample document, parameterize the HTML generation, then complete an implementation with:

my_document.import_HTML(my_html_source)

This gives me a very fast, easily maintained, fully programmatic way to pour content into PDF. ReportLab's processing efficiency is so good that I can comfortably generate all kinds of PDF documents for Web display on the fly. This gives me the opportunity to keep critical financial or engineering reports fully current with the latest data while preserving an appropriate visual appearance. Print documents enjoy the same choices for customization, of course.



Back to top


Putting together PDF pieces

A second crucial library function is copyPages. It appends an existing PDF document to a Canvas instance. copyPages makes it easy to construct a PDF document as a concatenation of several pieces.

For more sophisticated effects, ReportLab, like other PDF tool vendors, licenses a for-fee product. In ReportLab's case, its PageCatcher product annotates existing PDF documents, reorders their pages, reformats them for different printing methods, adds backgrounds (including watermarks), and fills in PDF forms. ReportLab documents several interesting uses for PageCatcher. One example is programmatic preparation of completed Internal Revenue Service (IRS) forms.

A final ReportLab capability I've found important is its management of Tables of Contents. Online document readers appreciate these navigational aids, which Adobe calls "bookmarks" or "outlines." Most PDF viewers show these as menus in a left-hand window. The ReportLab Reference itself constitutes a nice example of a bookmarked document. Such ReportLab functions as copyPages include an option to import an outline properly into a larger document, or discard it.



Back to top


Conclusion

Whenever a computing job seems tedious or error-prone -- updating documents "by hand," for example -- you should be on the lookout for a way to automate the process. If you have questions about how far this attitude can take you, refer back to my review of the Limoncelli and Hogan book ("Server clinic", May 2002). Those authors even constructed the drafts for their book as the output of make processes. Although many systems programmers don't seem to realize it, management of PDF documents presents rich opportunities for automation and abstraction. Use the ReportLab libraries or other available PDF-savvy tools to teach your server to do your PDF work. That should free your time for more productive pursuits.

Future installments of "Server clinic" are likely to touch on other underappreciated fields for server-side automation, including generation of Excel and Word documents.

Disclaimer: I'm on cordial personal terms with the employees of several companies that specialize in PDF-related products. However, I've never had a financial interest in any of the companies, nor any contractual relationship other than as an ordinary customer.



Resources

  • Participate in the discussion forum.

  • Check out the other installments of Server clinic.

  • You can think of PDF as a variation on PostScript. "The history of PostScript" gives the chronological perspective on their relationship.

  • "The history of PDF" explains PDF's development through a decade of new technical and commercial demands.

  • AFPL Ghostscript is an enormously popular software interpreter for PostScript and PDF, along with related tools. It is most commonly known as a no-fee viewer for displaying PostScript images online.

  • GSview is a part of the Ghostscript family that views and edits PDF files.

  • ps2pdf is an online service that converts PostScript sources to PDF. It uses Ghostscript behind the scenes.

  • PDFlib is a PDF-aware library with bindings to several development languages, including C, C++, C#, Java, Perl, PHP, Python, RPG, Tcl, and more.

  • ReportLab's several document-related products include the ReportLab Toolkit for PDF manipulation. Note, in particular, the ReportLab ToolKit Reference Manual.

  • Download Python 2.2.1.

  • My personal page of PDF comments is where I keep new notes and references on the subject of this month's column.

  • Read Cameron's previous columns on developerWorks:

  • If you now need to do something with all those PDFs you've generated, IBM Content Manager provides an infrastructure for integrating digital content with various applications.

  • Find more Linux articles in the developerWorks Linux zone.


About the author

Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks frequently on open source and other technical topics. You can contact Cameron at claird@phaseit.net.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top