About WebSphere Transcoding Publisher
Operating systems: AIX, Linux, Microsoft Windows NT and Windows 2000, OS/400, Sun Solaris
Technologies: Java, XML, Wireless
People are using more portable devices like phones and PDAs to access the Web, whether it's to check the latest sports scores, do some personal banking, or trade stocks. Businesses are also looking for ways to take advantage of the growing wireless wave by extending their existing applications to a remote workforce. With the rapid growth in the number of portable devices, the ability to tailor content to these small displays is increasingly important. One way content providers support these devices is to maintain separate sets of information for the different devices, but this can be expensive, both in creating and maintaining the information.
To deal with this problem, IBM WebSphere Transcoding Publisher includes support for document clipping, which lets you identify and extract specific portions of a document. In this way, you can keep those pieces of information that are important while discarding images or complicated markup that the client device can't display. This simplifies what is sent to the device while reducing the amount of data being transmitted, a particularly useful aspect when wireless devices are involved. Moreover, no changes to the source content are required.
Transcoding Publisher gives you two ways to perform document clipping:
- With an annotation language composed of special XML tags
- With a text clipper, which is a custom transcoder that can add, change, and delete parts of a text document. A transcoder is a program written in Java that Transcoding Publisher uses to perform various transformations. (A full discussion of text clippers is beyond the scope of this article, although the following section gives more information about text clippers and how they compare to annotation.)
Here, we describe the XML tagging in Transcoding Publisher's annotation language and then go on to show you several practical examples that demonstrate how you can easily tailor the content of a Web page. To get the most out of this discussion, you'll need a general understanding of markup languages like HTML and XML, as well as the related concepts of the XML Path Language (XPath) and the Document Object Model (DOM) (see Resources).
An annotation is composed of a special set of XML tags that, when combined with an HTML source file, dictate which parts of the HTML document should be clipped. You can put all of your annotations for a particular document in a separate annotation file called an annotator (referred to as external annotation) or you can embed annotations directly in the HTML file itself as comments (referred to as internal annotation). For most of the examples in this article, we'll be using the syntax for external annotation, though later we also include an internal example.
While annotation can handle most common clipping tasks, it may not always provide the capability or flexibility you require. With annotation the changes you can make to the DOM are limited by the capabilities provided by the annotation language. This is where text clipping comes in. Text clipping is more powerful than clipping with annotation, because you have the capability to make unlimited changes to the document. With text clipping you use the org.w3c.dom Java bindings (see Resources) to manipulate the DOM or work with the text (String) form of the document. If you wanted to, you could replace every element in the DOM. However, a text clipper is not limited to the DOM alone; you can extend the reach of the clipper so that you could modify the HTTP header on a request, for example, or you could specify that the response not be cached.
As you might expect, the added flexibility of a text clipper is not without tradeoffs. Because text clippers are written in Java, some programming experience is required, along with an understanding of how to create and use subclasses. Basic knowledge of how Transcoding Publisher handles requests is also necessary, and you'll need to be familiar with the way Transcoding Publisher manages text clippers and other transcoders. Finally, to make your text clipper available to others, you'll need to know how to package the text clipper files in a Java Archive (JAR) or ZIP file.
If you need more flexibility in modifying a document than annotation affords, a text clipper is worth considering, particularly if your clipping task is only one part of a larger, integrated process. There may even be times when you'll want to use annotation and a text clipper together to leverage their respective strengths.
Before we get into the annotation language itself, there are a few things about how Transcoding Publisher processes the annotations that you need to know. When Transcoding Publisher receives an HTML document for clipping, the first thing it does is generate an HTML DOM of the document. The DOM defines the logical structure of the document, representing the document in a hierarchical structure of related nodes. The annotator contains XPath statements that identify the nodes in the HTML DOM that are to be manipulated. When processing an annotator, Transcoding Publisher goes through the DOM and applies the annotations by deleting, adding, and changing nodes as required. When it's finished making its changes, Transcoding Publisher sends the clipped document to the requesting device, either in HTML format or in any other format supported by Transcoding Publisher, such as the Wireless Markup Language (WML) format used by Wireless Application Protocol (WAP) devices.
Let's look at a brief example to see how this works. The following Web page shows a simple arrangement of text and two tables.
Figure 1. Simple Web page as it appears before annotation

So, if you wanted to keep the heading text and then get rid of everything but the second table, you could do so with the following statements in an annotation file:
Annotations for simple annotation example
<description take-effect="after" target="/HTML[1]/BODY[1]/*[1]">
<remove />
</description>
<description take-effect="before" target="/descendant::TABLE[2]">
<keep />
</description>
<description take-effect="after" target="/descendant::TABLE[2]">
<remove />
</description>
|
As the sample above shows, the <description> tag is the primary
element in the annotation language. The target attribute identifies
the node on which the annotation will be applied, and the take-effect
attribute indicates whether the annotation is applied before or after the target
node. By specifying target="/HTML[1]/BODY[1]/*[1]" as in our example,
you activate clipping after the first node after the <BODY> tag, which in this case is an <H1> ("Simple Annotation Example").
The asterisk in *[1] is a wildcard that you can use so that you don't
have to know exactly what kind of node it is.
The <remove /> tag indicates that all tags encountered are to
be removed, until otherwise instructed by another annotation statement. In the example,
we stop clipping just before the second table with a <keep />
tag and then resume clipping after the second table. It's worth noting that
when you use take-effect="after", any manipulation will be
performed after the closing tag of the target node, so in this case,
any elements inside the second table are safely ignored.
How does the annotated Web page look? Figure 2 shows the resulting page, as it would appear on a desktop browser.
Figure 2. Simple Web page as it appears after annotation

The use of the <keep /> and <remove />
tags in our example illustrates the notion of a clipping state,
one of the main ideas behind the annotation function. The clipping state indicates
whether the content being processed should be preserved or removed, while providing
an easy means of keeping or discarding elements of a document, particularly
when you want to act on large chunks at one time. For example, if you were clipping
a document for display on a mobile phone, you might activate clipping with the
first node within the <BODY> element (<description take-effect="before"
target="/HTML[1]/BODY[1]/*[1]">) and then only keep a few selected pieces
thereafter.
Although the bulk of what you'll do with the annotation language is likely to be turning clipping on and off as we've already discussed, there are other ways you can manipulate a document with the language, both for more finely grained clipping and for specialized treatment of particular elements.
Other features of the language include:
- The ability to specify exception tags that will keep or remove HTML elements
on a global basis for a clipping region. For example,
<remove tag="IMG" />specifies that all images are to be removed, regardless of where they occur, until another<keep />or<remove />instruction is encountered. - Special handling of form elements. The annotation language enables you to
explicitly reorder, hide, or preset any field in an HTML form. In addition,
you can change how a field is represented; for example, you can change a
<textarea>field into a<select>menu and change the label text associated with the field. - Shortcut annotations that streamline the annotation of tables. With these
shortcut annotations, you can mark entire rows or columns for clipping. A
statement like
<column index="2" clipping="remove" />removes the second column of the selected table. Using XPath statements to perform the same task can be tedious.
Sample annotator: IBM stock quote
While the example described above demonstrates the fundamental ideas behind Transcoding Publisher's annotation support, it's time to take a look at more practical ways of applying annotation. Figure 3 shows the IBM stock quote page from the company's corporate Web site (http://www.ibm.com/ibm/stock).
Figure 3. Stock quote page from www.ibm.com

This page contains a lot of information, including details of the stock's performance, a search field, and numerous links to other areas of the site that might be of interest to the user. However, let's say you want to access this page through a Web-enabled phone rather than a desktop browser. Now the nested tables and images that made for a nicely laid out page are a hindrance rather than a help, and the sheer amount of information available becomes unwieldy in the small display (and potentially expensive, depending on your wireless service). Who wants to scroll through pages of information when all you're interested in are a few choice nuggets?
Fortunately, paring down the information on this page is not difficult. Suppose all you want to know is the price of the last trade and whether the price has moved up or down. The sample below shows the code for the annotator you can use to extract those details.
Annotator for IBM stock example
<?xml version='1.0' ?>
<annot version="1.0">
<description take-effect="before" target="/HTML[1]/BODY[1]/*[1]">
<remove />
</description>
<description take-effect="before" target="/descendant::IMG[1]">
<keep />
<replace>
<text>IBM Stock Quote</text>
</replace>
</description>
<description take-effect="after" target="/descendant::IMG[1]">
<remove />
</description>
<description take-effect="before" target="/descendant::TABLE[8]">
<keep />
<table>
<column index="*" clipping="remove" />
<column index="2" clipping="keep" />
<column index="3" clipping="keep" />
<row index="*" clipping="remove" />
<row index="6" clipping="keep" />
<row index="7" clipping="keep" />
<row index="8" clipping="keep" />
</table>
</description>
<description take-effect="after" target="/descendant::TABLE[8]">
<remove />
</description>
</annot>
|
This annotator is doing several things:
- Clipping starts immediately after the
<BODY>tag. In effect this sets the clipping state to remove, which will remain in effect until you explicitly change it with another annotation. - As it happens, the heading for the page ("Stock quote") is actually
an image. Because you don't want to have to download any unnecessary images
to the phone, use annotations to turn clipping off before the
<IMG>element and then replace that element with a<text>node with an alternate heading. In fact, since you don't really care which image you replace with your heading text, you operate on the first image you encounter, just to keep things as simple as possible. Then you turn clipping back on. - After looking carefully through the HTML source, you find that the information
you want is in the eighth table, so just before you hit this table, turn
clipping off. While you want to keep this table, you don't necessarily want
to keep every row or every column in the table. You can use special table
annotations to indicate that you want to delete all the columns in the
table (
<column index="*" clipping="remove" />, and then indicate the specific columns that you do want to keep (for example,<column index="2" clipping="keep" />will keep the second column). Then you do the same for the rows that you want to keep. - When you're finished with the table, turn clipping back on just after the table to remove everything else.
The resulting page is a much smaller HTML document. By taking advantage of Transcoding Publisher's ability to automatically convert HTML documents to WML format, you can display the annotated Web page on a WAP-based phone. Figure 4 shows how the IBM stock quote would appear on the phone. While still true that you have to scroll once or twice to see all of the information, it's obvious this is preferable to the excessive scrolling required to work through the original page. If you wanted to clip at an even more granular level, you could use annotations to target specific cells in the table.
Figure 4. Annotated IBM stock quote page, as viewed on mobile phone

The IBM stock page sample demonstrates how you can use Transcoding Publisher's
annotation language to quickly and easily create a clipped version of an HTML
document. However, if the structure of the original page were to change, the
annotator might not work anymore. For example, if another table was added somewhere
before the target table, the statement <description take-effect="before"
target="/descendant::TABLE[8]"> would key on the wrong table, yielding
irrelevant or even nonsensical information.
Rather than having to constantly adjust our annotator every time the target Web page is modified, you could write an annotator that can allow for changes in the content or layout of the page. Fortunately, the XPath language gives you a way to do just that.
Sample annotator: matching on comments
One way to take advantage of XPath's capabilities is to key on elements of the HTML page that are consistent, even when the majority of the content and layout change frequently. The Web page shown in Figure 5 is one such example. In this case, the Web page developer has included HTML comments that provide metadata. This metadata provides details regarding what the content is on the page.
Figure 5. XPath example using comments

For example, the HTML page has on it an introduction section, and two top story
sections. Each section is bounded by comments that describe the section. For
example, the first top story has a <!-- begin story 1 --> comment
immediately before it and an <!-- end story 1 --> immediately
after it. This convention is used throughout the page to delimit various parts
of the document.
Because the XPath language can be used to identify comments, as well as match
patterns contained in the comments, you can leverage the content in comments
to serve as locations for our XPath annotations. In our example, we use <keep />
and <remove /> statements that key on the stories we want to
keep, as shown in the annotator for XPath comments example.
The XPath instruction target="/descendant::comment()[contains(.,'begin
story 1')]" indicates that the clipping is turned off just before the first
element ("descendant") that is a comment containing the string "begin
story 1." Clipping is then turned back on after the "end story 1"
comment. The same is then done for other stories of interest. Figure
6 shows what the resulting clipped page looks like.
Figure 6. Annotated page based on comments

This results in annotations that keep the top story sections, and these annotations work regardless of how the rest of the page changes over time (as long as the same commenting convention is used).
Sample annotator: matching on element content
Sometimes you might want to leverage the relationships between elements of a page when applying annotations. The Web page in Figure 7 is an example where we want to use the XPath language to identify a table we want to keep.
Figure 7. XPath example with multiple tables

Although the number of tables on the page changes over time, we know the table we want always follows the string "Weather." The sample code, annotator for XPath tables example, shows how we used XPath statements to identify the table based upon this characteristic.
The statement target="/descendant::text()[contains(.,'Weather')]/following::TABLE[1]"
enables us to look specifically for the text element containing the string "Weather"
and then apply our <keep /> annotation to the first table that
follows the identified text. Similarly we can use the same notation to apply
the <remove /> annotation, with the only difference being whether
the take-effect attribute indicates that the annotation be applied
before or after the targeted table. The resulting annotated page is shown in Figure 8.
Figure 8. Annotated page with only desired table retained

This way we can be sure we keep the desired table, even if other tables are added to the page, either before or after the table we want.
Annotating from the inside out
Although what we've been describing up until now has been external annotation, you can also embed annotations directly in the HTML file itself. External annotation is great for working on content you're not responsible for creating, but if you're also creating the content you're delivering, internal annotation has advantages which you ought to consider.
Okay, so what's the difference between internal and external annotation?
Not much. Functionally, internal and external annotation are equivalent. You
add internal annotations as comments to the HTML file that you're clipping,
and Transcoding Publisher operates directly on those annotations. On the other
hand, to process external annotations, Transcoding Publisher has to merge them
with the HTML DOM according to the attributes of the description tags
before it can process the clipped document.
A good way to see how the two approaches differ is to take a look at an example
of internal annotation. Consider again the simple annotation example shown in
Figure 1. Here, we show the HTML file with
internal annotation corresponding to the first remove instruction from
the sample external annotation.
Internal annotation for the simple annotation example
<HTML> <HEAD> <TITLE>WebSphere Transcoding Publisher -- Simple Annotation Example</TITLE> <META name="Annotation_v1.0" content="keep"> </HEAD> <BODY> <H1>Simple Annotation Example</H1> <!-- <?xml version="1.0"?><annot version="1.0"><remove /></annot> --> <P>This Web page demonstrates how you can use WebSphere Transcoding Publisher's annotation support to keep some parts of a page while discarding others.</P> <P>For example, we might want to ... |
The red highlights show annotation-related markup embedded within the HTML file.
- The
METAtag specifies the default clipping state for the document, as well as the level of the annotation language that's used. This META tag is Transcoding Publisher's clue that the file contains internal annotations. If the tag is not in theHEADsection of the file, Transcoding Publisher will not bother to search for any other annotations in the HTML document.
- The
removeinstruction is contained within a mini-XML document wrapped in HTML comments. Its location marks the point at which we'll begin removing content. Note that this location is the same as that specified in thedescriptionfrom the sample external annotation (<description take-effect="after" target="/HTML[1]/BODY[1]/*[1]">).
If you use IBM WebSphere Studio, you're already a step closer to creating your own internal annotation. Version 3.5.2 (and later) of WebSphere Studio incorporates visual support for internal annotation. You can manipulate HTML files and JavaServer Pages by clicking on the element you want to annotate (say, an image or table) and then marking it to be kept or removed. Studio takes care of adding the appropriate lines to the file and sorts out the pesky details like syntax.
You can also use internal annotation to perform clipping on dynamic documents,
such as JavaServer Pages. This is especially useful when dealing with dynamic
tables, where the number of rows and columns might vary when the page is generated.
In this case, internal annotation (keep/remove) could be placed around
the repeating rows and columns instead of using the table annotation,
which assumes that you already know the structure of the table.
The landscape of mobile devices is changing fast, and if you're trying to deliver content to these devices, it can be hard to keep up. This article shows you a way you can deal with this challenge by describing how document clipping can be applied to Web pages with Transcoding Publisher's annotation language. Whether it's extracting key chunks of information for display on a small screen or replacing images with text to improve download performance, annotation is a quick and easy way of getting the job done.
- Learn more about the Document Object Model
(DOM) at the W3C Architecture Domain, which includes the discussion of Java bindings.
- For the gory details on the XPath language, check out the XML
Path Language information on the W3C's Web site.
- For information on WebSphere Transcoding Publisher, you can take a look
at the Transcoding
Publisher Web site or browse the product
library for other articles and whitepapers.
-
WebSphere Studio Try it! Free download of the fully-functional Entry Edition.
-
Using
Internal Annotation Part of the WebSphere Studio User's Guide.
- Learn more about how WebSphere products can help, with Developing
Web Applications for Pervasive Computing Devices.
Richard Spinks is a software engineer for IBM in Research Triangle Park, North Carolina, and is currently a user assistance developer working on WebSphere Transcoding Publisher. In his so-called spare time Richard does graduate work n Information Science at the University of North Carolina at Chapel Hill, where his interests include the development of user interfaces for managing and searching digital video collections. You can reach him at rspinks@us.ibm.com.

Brad Topol is a senior software engineer in the WebSphere Advanced Technology group. Currently, he is actively involved in advanced technology projects in the areas of transcoding, distributed systems, networking, and graphical user interfaces. Brad can be reached at btopol@us.ibm.com.
Chris Seekamp is a programming consultant and has been the primary developer and development team leader for the text-related portions of the WebSphere Transcoding Publisher since early in its development. He has been applying object-oriented design and development techniques for over 10 years, doing so the last several years on projects involving mobile devices. Chris can be reached at seekamp@us.ibm.com.

Steve Ims is a senior software engineer at IBM. As a member of a WebSphere Advanced Technology group, he spends his days contemplating next-generation services around the J2EE framework: Intelligent network infrastructure, mobile computing, and legacy integration. He's particularly interested in raising the usability of these services through integration with software tools. At night, he just tries to keep up with his perpetual-motion children. You can reach Steve at steveims@us.ibm.com.
Comments (Undergoing maintenance)

