The Internet has enabled companies to do business in an international marketplace. This makes it imperative to have Web-enabled ways of delivering international content to customers. Portable Document Format (PDF) is a popular format for delivering content on the Web; you can easily download a PDF document using any popular browser and then view it using Adobe Acrobat Reader, or you can use Adobe plug-ins for viewing within a browser. Generating PDF content for an international audience poses challenges, especially since the double-byte nature of languages like Japanese, Chinese, and Korean require special considerations. A Unicode font is usually a good solution, but this may be platform specific. Another important consideration is to avoid changing the application business logic just because you desire the content in PDF format, in addition to the usual HTML.
We start by discussing fonts and languages in general, and then we describe an approach for generating PDF documents in Java Web applications using open source technologies.
A font is a collection of character images, called glyphs, and the mappings from character codes to the glyphs. A character is a symbol that represents items like letters and numbers in a particular writing system. When a particular character is rendered, the shape representing this character is called a glyph. A popular font used in computer systems is the Unicode TrueType font. However, the TrueType font alone is not enough to render data in double-byte languages. The TrueType font is the best choice for rendering a wide range of characters, but it is system specific:
- In the Windows operating system, Microsoft provides the Arial Unicode MS font, which enables the display of characters in most languages including double-byte languages. This font comes in a single file; by default it is not installed on the user's system. Windows' international support feature may be required in order to use this font.
- Similarly, in IBM's AIX operating system TrueType fonts are not automatically installed if your system is installed and configured using English by default. You must install multiple font files such as AIXwindows Unicode TrueType fonts-CJK for languages like Japanese, Korean, and Chinese (simplified and traditional). If your operating system is Linux, we suggest you do additional research to see how to install TrueType fonts.
In a standard Web application that supports internationalization, it's important to be careful about locale and encoding issues. However, for internationalizing PDF documents, you need to consider an additional font-related issue.
In PDF, embedded fonts make documents portable so that they can be viewed on any operating system with Adobe Acrobat Reader. If fonts are not embedded, a localized version of Reader on the user's system picks up the native available fonts for displaying that language. For example, if a system is English-language enabled, Base 14 PostScript fonts are used for substituting fonts on screen and for printing; this covers most single-byte languages, but does not cover any double-byte languages. The other option is to prompt the user to download Font Pack from Adobe for viewing international documents; however it's not a good idea to ask the customer to download additional font utilities. By using embedded fonts in PDF, you need not worry about whether a remote user or machine has the fonts required to display your document. So using embedded Unicode TrueType fonts for rendering internationalized PDF data is a good solution.
Our approach of generating PDF documents in Web applications is based on Formatting Objects Processor (FOP). FOP is the world's first print formatter driven by XSL Formatting Objects (XSL-FO). It is an open source project under Apache's XML using Java technology.
Figure 1 illustrates the general flow for generating PDF documentation using open source FOP. The input data needs to be in XML UTF-8 encoding. The transformer takes the XML data, applies the stylesheet to it, and generates the PDF. To lay out the PDF, the FOP formatter needs to know the details about the fonts to be used in the document, particularly the widths of all the glyphs used. It needs these details to calculate line lengths, hyphenation, justification, and so on. This information is known as the metrics of the font, and is stored with each font. When the metrics are available to the formatter, the FOP formatter can successfully lay out the PDF. Later in this article, we discuss how the metrics file is generated.
Figure 1. PDF transformation using Apache open-source FOP
In addition, the font-family attribute is also used in the stylesheet to generate PDF from XSL-FO. Based on the W3C definition, either the font family name or the generic family name can be used. The font-family values are Helvetica, Times, Courier, and Symbol. The generic families are serif, sansserif, cursive, fantasy, and monospace.
In XSL-FO, the font-family property is a prioritized list of font family names, which are attempted in sequence to find an available font that matches the selection criteria (shown in Listing 1). However, the current FOP does not support the font-family list; it only uses the first font in the list, if that exists. For example, if you specify font-family="A,B,C" (as in Listing 1) and A doesn't exist, then B and C are ignored and not used. Please refer to the W3C site for more information.
Listing 1. Typical XSL-FO syntax using font list in font-family is not supported in FOP
<fo:block text-align="center" font-family="A,B,C" font-size="16pt"> <xsl:text>Welcome! </xsl:text> </fo:block> |
We have discussed that FOP is an open-source Java program for PDF transformation that uses XSL-FO and XML. Another open source PDF transformer called iText (see Resources) uses an object-oriented approach and provides Java objects to render the PDF documents. Both approaches have their pros and cons, but we believe that the XSL-FO model fits better in the Model-View-Controller architecture (MVC) architecture at the view tier. It is well supported by Struts for Transforming XML and XSL (stxx) extensions, which are based on the Struts application framework. Additionally, this approach is better suited to generating PDF reports where the templates for the reports are specified using XSL, which is endorsed by W3C.
We will use a sample Java Web application to demonstrate the steps involved in setting up Unicode fonts, generating the font matrix files, and using stxx to generate PDF documents.
The sample application takes some application data and transforms it into PDF. We assume that you have some general knowledge of XML, XSL-FO, Struts, and J2EE. The sample application has two versions:
- The first version shows how the generated PDF cannot display double-byte languages properly without using the Unicode-embedded fonts
- The second version demonstrates the additional steps needed to generate the PDF for double-byte languages
Essentially, it is the same application running in two modes which are controlled by a flag in an XML configuration file. This approach requires the data to be available in XML format with Unicode encoding. Many Web applications generate XML at some point, in which case that XML can be fed into the stxx FOP along with the appropriate XSL to do the transformation. However, if the XML is not already generated in the application, an additional step is required to transform the application data into XML and make it ready for transformation. The XML data that's fed into stxx is enhanced by adding the locale information. You can use the locale to pick the appropriate static messages displayed in the PDF document, along with the dynamic data. For example, in an order status application that generates PFD output, the static labels can be picked from a resource bundle using the appropriate locale.
To run the sample application, we have used Tomcat 4.1.30 and stxx 1.3. Tomcat is an open source servlet container. You are free to use another servlet container -- either an open-source one or a commercial product like IBM WebSphere Application Server. stxx is an extension of the Struts framework that supports XML and XSL without changing the runtime behavior of Struts. The current release of stxx is version 1.3, which uses FOP 0.20.4.
The sample application is a simple Struts-based application. We used IBM's WebSphere Studio as our development environment. (WebSphere Studio uses the Eclipse tooling framework, integrates well with Tomcat server, and has Java and XML development features). The data to be rendered as PDF comes from an XML file that contains multilingual content (see Figure 2). It includes the text "Product Service" translated into different double-byte languages. When the application is initialized, the embedded fonts are loaded into the Struts system. Using a stxx action, the XML document is converted to stxx XML format, which includes the Web application locale information. We used the Apache Jakarta Digester pattern to load the XML configuration file into a Struts plug-in. The stxx FOP transformer does the PDF transform by using the XSL-FO stylesheet. Arial Unicode MS, which belongs to the Helvetica font family, is specifically used in the stylesheet, and ultimately the font information is embedded in the PDF.
If you extract the provided sample.war file, you should be able to see all the configuration files and the source code.
Figure 2. Sample XML data
Steps for deploying the sample application
The sample shown in this article uses an English language machine that doesn't have the Unicode font installed. If you are using a different machine (for example, a Japanese machine), the steps remain the same.
You can deploy the sample application on any J2EE-compliant servlet container. Instructions on how to install the application on Tomcat are provided below. (See Resources for additional detail.)
- Once Tomcat is installed, start the server.
- Using Tomcat Web Application Manager at a URL such as http://localhost:8080/manager/html/list, install the sample Web application by uploading the sample.war file provided in Download.
- Copy all the necessary jar files, such as stxx-1.3.jar and struts.jar, from stxx to the Tomcat sample Web application lib directory. The jar files list should look like those in Figure 3.
Figure 3. jar files in sample Web application lib directory
- Bring up homepage.html using a URL such as http://localhost:8080/sample/homepage.html. You will see the home page of the sample application as in Figure 4. You have the option of viewing the PDF in a browser or downloading it to your system.
Figure 4. Sample Web Application Homepage
- Click View PDF to see the PDF rendered as in Figure 5.
Figure 5. PDF rendered with the wrong font
As you can see, junk symbols like "#####" show up for the double-byte sample data. This means the default font setting in the system cannot display the characters for double-byte languages. At this point, you are finished running the first version of the application.
To ensure that the double-byte data is rendered correctly, take the following steps to embed the Unicode font for use with FOP:
- Install the Unicode TrueType font if your system doesn't support it.
- Generate a Unicode TrueType font metrics file.
- Register the embedded font with FOP.
- Transform using FOP in a servlet engine.
Step 1: Install the Unicode TrueType font
This sample is developed on a Windows system. Refer to Microsoft's international support site to install the Arial Unicode MS font (see Resources). If you use another operating system, please follow the standard instructions to install the font.
Step 2: Generate a Unicode TrueType font metrics file
TrueType font files come in two types: a TrueType Font file (.ttf extension) and a TrueType Collection file (.ttc extension). FOP allows both of them to be embedded.
After the Unicode font is installed, check to see if the font file is in your font directory (For example, C:\windows\fonts\ARIALUNI.TTF). Then use the FOP command in Listing 2
to generate the font matrix file. Make sure that your classpath is set correctly. Save the generated XML file for the Web application to use at run time.
Listing 2. Using FOP to generate a font matrix file
$ java org.apache.fop.fonts.apps.TTFReader C:\windows\fonts\Arialuni.ttf arialuni.xml |
Step 3: Register the embedded font with FOP
It is a good idea to register the embedded font with FOP when you initialize the application. In Struts, you can use a plug-in to load the matrix file. You can create an XML file that specifies the font matrix file location. The format we chose is shown in Listing 3.
Modify userconfig.xml to pick up the font matrix file created above.
Listing 3. Use userconfig.xml to register the embedded Unicode TrueType font
<font metrics-file="C:/temp/font/arialuni/arialuni.xml"
embed-file="C:/windows/Fonts/arialuni.ttf" kerning="yes">
<font-triplet name="arialuni" style="normal" weight="normal"/>
<font-triplet name="arialuni" style="normal" weight="bold"/>
<font-triplet name="arialuni" style="italic" weight="normal"/>
<font-triplet name="arialuni" style="italic" weight="bold"/>
</font>
|
The application configuration is specified in a separate configuration file (pdf-userconfig.xml), which is selected by the Struts plug-in. It is in the "WEB-INF" folder in the sample application. The content is shown in Listing 4. By default, the enabled attribute is set to false.
Listing 4. pdf-userconfig.xml Web application configuration
<configuration> <pdf-fonts> <userconfig name="pdf-unicode" path="font\userconfig.xml" enabled="false" comment="for Unicode Font"/> </pdf-fonts> </configuration> |
To load the font matrix file, load userconfig.xml using the Struts plug-in and Java code as shown in Listing 5.
Listing 5. Java code used in Struts plug-in to enable embedded font
...
try {
File userConfigFile = new File("userconfig.xml");
org.apache.fop.apps.Options options
= new org.apache.fop.apps.Options(userConfigFile);
} catch (FOPException fe) {
fe.printStackTrace();
}
...
|
Step 4: Transform using FOP in a servlet engine
Following the Struts architecture, you can create an action that takes three steps to generate the PDF document:
- Construct an XML document.
- Process the documentation view option over the Web.
- Render the PDF.
Listing 6 shows the code.
Listing 6. Struts action for rendering PDF
public class SampleXslFoAction extends Action {
private String xmlUsed = "/xml/sample_transform.xml"; //default
private String successFwd = "success"; //default
public org.apache.struts.action.ActionForward execute(
ActionMapping mapping,
ActionForm form,
HttpServletRequest request,
HttpServletResponse response)
throws IOException, ServletException {
//**************************
// make user selections
//**************************
decideSuccessFwd(request);
decideXmlUsed(request);
//*******************
// Construct XML
//*******************
Document doc = null;
try {
String fileName = request.getRealPath(xmlUsed);
FileInputStream fis = new FileInputStream(fileName);
doc = new SAXBuilder().build(fis);
}
catch (Exception ex) {
ex.printStackTrace();
}
saveDocument(request, doc);
//****************************
// process PDF doc view option
//****************************
if(request.getParameter("viewformat").equalsIgnoreCase("download")){
response.setContentType("application/pdf");
response.setHeader("Content-Disposition",
"attachment;filename=pdfdoc.pdf");
}
//**************************
//Go forward rendering it
//**************************
return mapping.findForward(successFwd);
}
...
}
|
All of the configuration and code is already implemented in the sample. Once you have followed the steps to generate the font matrix file and change the userconfig.xml file, you have to change the enabled property to true in the pdf-userconfig.xml file. Then, restart the Tomcat server and access homepage.html again. This time, if you click View PDF, you should see all the content displayed correctly as in Figure 6.
Figure 6. PDF view of FOP-transformed multilingual content
As you can see, the sample text "Product Service" is rendered correctly in various languages.
You can use XSL-FO and FOP to dynamically generate international PDF documents in Web applications. In the absence of a universal solution that can satisfy the font requirements for different operating systems, XSL-FO provides a system-dependent solution for rendering international PDF documents. The stxx feature used in this article fits in well with the MVC architecture for Web applications. In this approach, the PDF transformation becomes transparent to the application developer. Once an XML DTD is finalized, the application developer can focus on the business logic and the XSL developer can define the XSL for transformation.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample Web app to generate internationalized PDFs | x-ospdf_OpenSourcePDF.zip | 43 KB | HTTP |
Information about download methods
- Get more information on Acrobat Reader at the Adobe site.
- Learn more about font standards and definitions at the World Wide Web Consortium (W3C) site.
- Download the open source Formatting Objects Processor (FOP) from the Apache site.
- Take a closer look at Struts for transforming XML and XSL (stxx) on SourceForge.net.
- To run the sample application, the authors have used Apache's Tomcat, an open source servlet container.
- Generate PDF files on the fly with the iText Java library.
- Read about Apache Struts, an open source framework for building Java Web applications.
- Learn how to build and run a Web application through these helpful tutorials on Eclipse.org.
- Read James Goodwill's article "Deploying Web Applications to Tomcat" at ONJava.com.
- Check out Wang Yu's article "Multibyte-character processing in J2EE" and learn how to develop J2EE applications with multibyte characters.
- Visit Microsoft support for the Unicode Arial MS font.
- Let IBM's Doug Tidwell show you the ropes of XSL-FO:
- XSL-FO basics tutorial (February 2003)
- XSL-FO advanced techniques tutorial (February 2003)
- HTML-to-FO conversion guide (February 2003)
- Learn more about IBM WebSphere Application Server right here on developerWorks.
- Browse for books on these and other technical topics.
- Find more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
Ning Yan is a Software Engineer at IBM. His expertise is in Web application development and business solutions that include DB2, WebSphere, and open-source technologies. He is an IBM Certified DB2 Specialist and Brainbench Certified Web Service Engineering and Web Developer, and he received an M.S. in computer science in 1996 from the State University of New York at Albany. You can contact Ning at nyan@us.ibm.com.
Ajay Raina is a Team Lead in the IBM Corporate Webmaster team and helps in IBM’s business transformation using WebSphere technologies. He is the Lead Architect of Order Status OnLine application deployed on ibm.com. Earlier, he led the Authorization module of WebSphere Commerce Server. His expertise is in J2EE, WebSphere, and DB2. Ajay received his B.E. (Hons) in Electrical and Electronics Engineering from BITS, Pilani, India and an M.S. in Computer Science from New York University. You can contact Ajay at ajraina@us.ibm.com.




