Skip to main content

Managing ezines with JavaMail and XSLT, Part 1

Use XML and XSLT to automatically produce both plain text and HTML newsletters

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoît Marchal is a consultant and writer based in Namur, Belgium. He wrote both XML by Example and Applied XML Solutions . He is a columnist for Gamelan.
Ben learned first hand about e-zine publishing when he launched Pineapplesoft Link in 1998. You can subscribe to his e-zine and find details on his latest projects at www.marchal.com.

Summary:  In part one of two-part series, Benoît Marchal demonstrates how to automate e-mail publishing chores with Java and XML. This concrete application of XML and XSLT describes an e-mail newsletter (e-zine) publishing application that outputs both HTML and plain text e-mail messages. Six reusable code samples include a sample newsletter marked up in DocBook, an XSL style sheet to convert the DocBook sample to a custom text output, a Java text formatter (in the form of a SAX ContentHandler), two SAX filters, and the Java code that puts it all together in a multistepped transformation. (The next part of this article covers the JavaMail API.)

Date:  01 Mar 2001
Level:  Introductory
Activity:  3283 views

So you've learned XML. You mastered your way through DTD, XSLT, SAX and DOM. You unraveled the secrets of namespaces, and you think you're on top of the X-rated acronym soup. Congratulations! Now what?

Judging from the developer feedback I hear, you're not alone asking yourself that crucial question. This article proposes one answer through a practical application. It demonstrates how to automate publishing chores with Java and XML. As such, I think it will prove inspiring.

This article does not include an introduction to XML. I assume you are familiar with XSLT and have some notion of SAX parsing. Even if you need background on those topics, you might still want to read through the article, as it will inspire you to learn more. But make sure you consult the Resources section for some basic XML references.

XML ... and e-mail?

XML may not seem like a natural technology match with e-mail. Stay with me, and you may be surprised by the utility of this strange combination.

As you probably know, Eudora, Outlook, Netscape, and other modern e-mail clients let you send HTML e-mails. Originally e-mail messages were limited to plain text and they would not support bold, italics or hyperlinks. Modern e-mail clients recognize HTML, and so you can now send either plain text messages or richly decorated documents.

This choice of e-mail formats poses a problem to e-mail magazine (e-zine) publishers. Indeed the choice plays a part in the strategies e-zine publishers develop to confront their two biggest problems: acquiring and retaining subscribers. Unfortunately, subscribers have strong positions for or against HTML e-mails.

To make things worst, some e-mail clients (including the popular AOL 4.0 to 5.0) do not support HTML at all. Unless you are extra careful, subscribers with those older e-mail clients see only garbage.

Traditionally, e-zine publishers have gone to great lengths to ensure their reader's comfort. In the days of plain-text e-mails, savvy publishers would manually format their prose. Some continue this fine tradition with HTML e-mails, painfully preparing two versions of each document: plain text for older e-mail clients and HTML for newer ones. When I heard about that, a lightbulb popped over my head and I thought "XSLT style sheets." (This may be a sure sign that I should get a life.)


Principles

In this two-part article, you'll see how XML, XSLT and some Java programming can simplify things. In the process of doing so, you'll use various XML techniques. Let's start by reviewing them all:

  • XML itself, of course. The e-zine will be written in XML and, more specifically, in DocBook. DocBook is a popular XML vocabulary for technical documentation.
  • XSLT is typically used to convert XML documents to HTML. That would solve half of our problem (preparing the HTML version of the e-zine).
  • A special text formatter that enhances XSLT support for text. Indeed, as you might have understood, top-notch text formatting is a priority for e-zines.
  • JavaMail, the standard Java API to send e-mail.

Figure 1 illustrates the relationship between these components. From left to right, the ultimate goal is to prepare a so-called multipart e-mail with both text and HTML versions of the e-zine.


Figure 1. How the components of the solution interact
Workflow

Preparing the e-mail involves going through two style sheets: one creates the text output, the other outputs the HTML version. The text formatter assists the text style sheet. JavaMail picks up both copies and sends them to subscribers.

This first installment of this series concentrates on the text transformation. The second installment will wrap things up with JavaMail.


The DocBook document

The starting point is the article in article.xml in Listing 1. It is written in DocBook, meaning that the XML tags (<article>, <title>, <para>) are all tags defined by DocBook.


Listing 1. article.xml
				

<?xml version="1.0"?>
<article>
<articleinfo>
 <title>XSL -- First Step in Learning XML</title>
 <author><firstname>Benoît</firstname>
  <surname>Marchal</surname></author>
</articleinfo>
<sect1><title>The Value of XSL</title>
 <para>This is an excerpt from the September 2000 issue of
  Pineapplesoft Link. To subscribe free visit
  <ulink url="http://www.marchal.com">marchal.com</ulink>.</para>
 <para>Where do you start learning XML? Increasingly my answer
  is with XSL. XSL is a very powerful tool with many
  applications. Many XML applications depend on it. Let's take
  two examples.</para>
</sect1>
<sect1>
 <title>XSL and Web Publishing</title>
 <para>As a webmaster you would benefit from using XSL.</para>
 <para>Let's suppose that you decide to support smartphones.
  You will need to redo your web site using WML, the
  <emphasis>wireless markup language</emphasis>, instead of
  HTML. While learning WML is easy, it can take days if not
  months to redo a large web site. Imagine having to edit every
  single page by hand!</para>
 <para>In contrast with XSL, it suffices to update one style
  sheet the changes flow across the entire web site.</para>
</sect1>
<sect1>
 <title>XSL and Programming</title>
 <para>The second facet of XSL is the scripting language. XSL
  has many features of scripting languages including loops,
  function calls, variables and more.</para>
 <para>In that respect, XSL is a valuable addition to any
  programmer toolbox. Indeed, as XML popularity keeps growing,
  you will find that you need to manipulate XML documents
  frequently and XSL is the language for so doing.</para>
</sect1>
<sect1>
 <title>Conclusion</title>
 <para>If you're serious about learning XML, learn XSL. XSL is
  a tool to manipulate XML documents for web publishing or
  programming.</para>
</sect1>
</article>


Text-markup language

Now let's see how to convert DocBook to text. XSLT has some support for text formatting (in the form of <xsl:output method="text"/> but, in my experience, it is inadequate for e-zine publishing. More specifically, with XSL text output, it's:

  • Impossible to break lines at a specific length (a requirement with old e-mail clients)
  • Difficult to remove accented characters (another limitation with old e-mail clients)
  • Troublesome to remove duplicate spaces in the original document

At first sight it would appear that XSLT cannot help, but a small dose of Java programming can make it work. The trick is to define a special XML vocabulary, which I'll call the text-markup language, to describe text documents.

I created this text-markup language specifically for this article, so it's as simple as it needs to be. Indeed it has only two tags: <txt:root> (the root of the document) and <txt:block> (a paragraph with a line break before and after it). Both are defined in the http://www.psol.com/xns/xslist/xml2text namespace. Incidentally, remember that a namespace is just an identifier; it looks just like a URL, but it does not point to anything.

<txt:root> has a lineWidth attribute for the ... that's right: the line width. <txt:block> has a linesAfter attribute with the number of line breaks after the block.

Next, you write a Java application to convert text-markup language to plain text. For example, the document below (Input) will become the following document (Output). Notice that the line breaks occur after 65 characters as specified by the lineWidth attribute:


Input
				

<?xml version="1.0" encoding="UTF-8"?>
<txt:root lineWidth="65"
          xmlns:txt="http://www.psol.com/xns/xslist/xml2text">
<txt:block linesAfter="1">This is an excerpt from the September
2000 issue of Pineapplesoft Link. To subscribe free visit
marchal.com.</txt:block>
<txt:root>


Output
				

This is an excerpt from the September 2000 issue of
Pineapplesoft Link. To subscribe free visit marchal.com.

To convert from the original XML document to text-markup language, I'll use XSLT (of course). Incidentally, why bother with the text-markup language? If I'm going to write Java code, why not process DocBook directly? In a nutshell, because it's easier this way. For example:

  • Instead of parsing all the many tags in DocBook, I need to process only the two tags in the text markup language.
  • To change the text output, it suffices to edit the style sheet and, because XSLT is a scripting language, that's easier than hacking around in Java.
  • Last but not least, the combination of text-markup language and XSLT works with DocBook and any other XML vocabulary.

If you're familiar with XSL, this text-markup language is similar to using FO to create PDF files.


The style sheet

The style sheet to convert from DocBook to the text-markup language is text.xsl in the Listing 2. Notice the <xsl:output method="xml"/> tag: this style sheet converts from XML (DocBook) to XML (text markup language) -- not HTML.


Listing 2. text.xsl
				

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:txt="http://www.psol.com/xns/xslist/xml2text">

<xsl:output method="xml"/>

<xsl:template match="/">
<txt:root lineWidth="65">
   <xsl:apply-templates/>
</txt:root>
</xsl:template>

<xsl:template match="articleinfo">
   <txt:block linesAfter="0">> <xsl:value-of
       select="title"/> <</txt:block>
   <txt:block linesAfter="2">
       by <xsl:value-of select="author/firstname"/>
       <xsl:value-of select="author/surname"/>
    </txt:block>
</xsl:template>

<xsl:template match="sect1/title">
   <txt:block linesAfter="1">* <xsl:apply-templates/> *</txt:block>
</xsl:template>

<xsl:template match="ulink">
   <xsl:apply-templates/>
   <xsl:text> <</xsl:text>
   <xsl:value-of select="@url"/>
   <xsl:text>></xsl:text>
</xsl:template>

<xsl:template match="emphasis">
   <xsl:text>*</xsl:text>
   <xsl:apply-templates/>
   <xsl:text>*</xsl:text>
</xsl:template>

<xsl:template match="para">
   <txt:block linesAfter="1"><xsl:apply-templates/></txt:block>
</xsl:template>

</xsl:stylesheet>


The text formatter

You can see the text formatter itself, Xml2Text.java, in Listing 3. Xml2Text is a SAX ContentHandler. (If you're not familiar with SAX, see the sidebar SAX defined.) As SAX handlers go, this one is easy. In the startElement() and characters() events, it buffers the content of <txt:block>. In endElement(), Xml2Text writes the text and inserts line breaks as appropriate.


Listing 3. Xml2Text.java
				

package com.psol.xslist;

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class Xml2Text
   extends DefaultHandler
{
   protected static final String
      NAMESPACE_URI = "http://www.psol.com/xns/xslist/xml2text";
   protected static final int NONE = 0,
                              ROOT = 1,
                              BLOCK = 2;
   protected StringBuffer buffer;
   protected int state,
                 lineWidth,
                 linesAfter;
   protected PrintWriter writer = null;

   public Xml2Text(PrintWriter writer)
   {
      this.writer = writer;
   }

   public void startElement(String uri,
                            String name,
                            String qualifiedName,
                            Attributes atts)
   {
      if(!uri.equals(NAMESPACE_URI))
         return;
      if(state == ROOT && name.equals("block"))
      {
         state = BLOCK;
         buffer = new StringBuffer(128);
         try
         {
            linesAfter =
               Integer.parseInt(atts.getValue("linesAfter"));
         }
         catch(NumberFormatException e)
         {
            linesAfter = 0;
         }
      }
      else if(state == NONE && name.equals("root"))
      {
         state = ROOT;
         try
         {
            lineWidth =
               Integer.parseInt(atts.getValue("lineWidth"));
         }
         catch(NumberFormatException e)
         {
            lineWidth = 65;
         }
      }
   }

   public void endElement(String uri,
                          String name,
                          String qualifiedName)
   {
      if(!uri.equals(NAMESPACE_URI))
         return;
      if(state == BLOCK && name.equals("block"))
      {
         state = ROOT;
         int start = 0,
             current = start,
             lastSpace = start - 1;
         while(current < buffer.length())
         {
            while(current < start + lineWidth &&
                  current < buffer.length())
            {
               if(Character.isWhitespace(buffer.charAt(current)))
                  lastSpace = current;
               current++;
            }
            if(current < buffer.length() && start < lastSpace)
            {
               for(int i = start;i < lastSpace;i++)
                  writer.print(buffer.charAt(i));
               start = lastSpace + 1;
            }
            else
            {
               for(int i = start;i < current;i++)
                  writer.print(buffer.charAt(i));
               start = current;
            }
            current = start;
            lastSpace = start - 1;
            writer.println();
         }
         for(int i = 0;i < linesAfter;i++)
            writer.println();
         buffer.delete(0,buffer.length());
      }
      else if(state == ROOT && name.equals("root"))
         state = NONE;
   }

   public void characters(char[] chars,int start,int length)
   {
      if(state == BLOCK)
         buffer.append(chars,start,length);
   }

   public void startDocument()
   {
      state = NONE;
   }

   public void endDocument()
   {
      writer.flush();
   }
}


Two handy SAX filters

SAX defined


SAX is the Simple API for XML, one of the most efficient solutions for processing XML documents. SAX is an event-based API, meaning that the parser sends events to your application (rather than reading the document's entire node tree into memory).

The most important events are startElement(), endElement(), and characters().

SAX filters are special event handlers designed to be chained with each other. The second installment of this article series will revisit SAX.

Xml2Text lacks the ability to remove unwanted spaces and accented letters. Instead of cramming these features in Xml2Text, it makes more sense to implement them as two SAX filters. The beauty of SAX filters is that you can freely combine them.

I can think of several other cases when I could use these two filters, for example, to remove unwanted spaces as preprocessing before publishing HTML documents.

WhitespaceFilter.java in Listing 4 is the SAX filter that removes duplicate spaces. Again, if you are familiar with SAX handlers, this class is easy. In startElement() and characters(), it buffers the text. endElement() removes duplicate spaces. Note that this code is optimized for clarity, not efficiency: it buffers too much.

The filter also recognizes the standard xml:space attribute. You have probably forgotten about xml:space but it is defined in the original XML standard. It takes one of two values: preserve (preserve duplicate spaces, like HTML <pre> ) and default which means duplicate spaces can be removed.


Listing 4. WhitespaceFilter.java
				

package com.psol.xslist;

import java.util.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class WhitespaceFilter
   extends XMLFilterImpl
{
   protected Stack stack;

   public WhitespaceFilter()
   {
      super();
   }

   public WhitespaceFilter(XMLReader reader)
   {
      super(reader);
   }

   public void startElement(String uri,
                            String name,
                            String qualifiedName,
                            Attributes atts)
      throws SAXException
   {
      String space = atts.getValue("xml:space");
      if(null != space && space.equals("preserve"))
         stack.push(null);
      else
         stack.push(new StringBuffer());
      super.startElement(uri,name,qualifiedName,atts);
   }

   public void endElement(String uri,
                          String name,
                          String qualifiedName)
      throws SAXException
   {
      Object object = stack.pop();
      if(object instanceof StringBuffer)
      {
         StringBuffer input = (StringBuffer)object,
                      output = new StringBuffer();
         boolean wasWhitespace = false;
         for(int current = 0;current < input.length();current++)
         {
            char c = input.charAt(current);
            if(c == '\n' || c == '\r')
               c = ' ';
            if(Character.isWhitespace(c))
            {
               if(!wasWhitespace)
                  output.append(c);
               wasWhitespace = true;
            }
            else
            {
               output.append(c);
               wasWhitespace = false;
            }
         }
         char[] chars = new char[output.length()];
         output.getChars(0,output.length(),chars,0);
         super.characters(chars,0,output.length());
      }
      super.endElement(uri,name,qualifiedName);
   }

   public void characters(char[] chars,int start,int length)
      throws SAXException
   {
      Object object = stack.peek();
      if(object instanceof StringBuffer)
         ((StringBuffer)object).append(chars,start,length);
      else
         super.characters(chars,start,length);
   }

   public void startDocument()
      throws SAXException
   {
      stack = new Stack();
      super.startDocument();
   }
}

The second filter, AsciiFilter.java (see Listing 5), removes accented characters and other special characters not recognized by old e-mail clients. All the processing takes place in characters().

Note that AsciiFilter does not filter attributes and, for the simplicity of this example, it's limited to the accents used in the French language. You might want to add more special characters to filter other languages.


Listing 5. AsciiFilter.java
				


package com.psol.xslist;

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class AsciiFilter
   extends XMLFilterImpl
{
   public AsciiFilter()
   {
      super();
   }

   public AsciiFilter(XMLReader reader)
   {
      super(reader);
   }

   public void characters(char[] chars,int start,int length)
      throws SAXException
   {
      StringBuffer filtered =
         new StringBuffer((int)(length * 1.1));
      int i = start,
          stop = start + length;
      while(i < stop)
      {
         char c = chars[i++];
         switch(c)
         {
            case '?x009C;':
               filtered.append("oe");
               break;
            case '©':
               filtered.append("(c)");
               break;
            case 'à':
            case 'ä':
               filtered.append('a');
               break;
            case 'æ':
               filtered.append("ae");
               break;
            case 'ç':
               filtered.append('c');
               break;
            case 'è':
            case 'é':
            case 'ê':
            case 'ë':
               filtered.append('e');
               break;
            case 'î':
            case 'ï':
               filtered.append('i');
               break;
            case 'ô':
            case 'ö':
               filtered.append('o');
               break;
            case 'ù':
            case 'û':
            case 'ü':
               filtered.append('u');
               break;
            // more characters would come here
            default:
               filtered.append(c);
         }
      }
      char[] newChars = new char[filtered.length()];
      filtered.getChars(0,filtered.length(),newChars,0);
      super.characters(newChars,0,filtered.length());
   }
}



Running the project

Console.java in Listing 6 puts all the pieces together. It applies the style sheet (through the standard Java API designed by Sun) and runs the result through the text formatter. Mind the fact that this really is a multistep transformation: from DocBook to the text-markup language to plain text.


Listing 6. Console.java
				

package com.psol.xslist;

import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;

public class Console
{
   public static void main(String[] args)
   {
      try
      {
         if(args.length < 3)
         {
            System.out.println("java com.psol.xslist.Console " +
                               "input.xml stylesheet.xsl output.txt");
            return;
         }
         Xml2Text xml2Text =
            new Xml2Text(new PrintWriter(new FileWriter(args[2])));
         WhitespaceFilter whitespaceFilter = new WhitespaceFilter();
         whitespaceFilter.setContentHandler(xml2Text);
         AsciiFilter asciiFilter = new AsciiFilter();
         asciiFilter.setContentHandler(whitespaceFilter);
         TransformerFactory factory = TransformerFactory.newInstance();
         Transformer transformer =
            factory.newTransformer(new StreamSource(new File(args[1])));
         transformer.transform(new StreamSource(new File(args[0])),
                               new SAXResult(asciiFilter));
      }
      catch(IOException e)
      {
         System.err.println(e.getMessage());
      }
      catch(TransformerException e)
      {
         System.err.println(e.getMessage());
      }
   }
}


In summary

This might seem like a lot of work to please a group of subscribers with antiquated e-mail clients. Why bother? Some people would suggest the subscribers should upgrade, but many e-zine publishers are willing to go the extra mile to satisfy their readers. Furthermore, thanks to XML and XSLT, it's not too difficult to automate the repetitive parts of the process, making it more practical to make the effort.

In the second installment, I'll show how to combine the text conversion with JavaMail to completely automate the operation.



Download

DescriptionNameSizeDownload method
Source code for this articlex-xmlist1-xslist.zip1451 KB HTTP

Information about download methods


Resources

About the author

Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He wrote both XML by Example and Applied XML Solutions . He is a columnist for Gamelan.
Ben learned first hand about e-zine publishing when he launched Pineapplesoft Link in 1998. You can subscribe to his e-zine and find details on his latest projects at www.marchal.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=11980
ArticleTitle=Managing ezines with JavaMail and XSLT, Part 1
publish-date=03012001
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers