Skip to main content

Document clipping with annotation

How to keep the good stuff and throw out the rest

Richard Spinks (rspinks@us.ibm.com), Software Engineer, IBM
Richard Spinks is a software engineer for IBM in Research Triangle Park, North Carolina, and is currently a user assistance developer working on WebSphere Transcoding Publisher. In his so-called spare time Richard does graduate work n Information Science at the University of North Carolina at Chapel Hill, where his interests include the development of user interfaces for managing and searching digital video collections. You can reach him at rspinks@us.ibm.com.
Brad Topol (btopal@us.ibm.com), Software Engineer, IBM
Brad Topol
Brad Topol is a senior software engineer in the WebSphere Advanced Technology group. Currently, he is actively involved in advanced technology projects in the areas of transcoding, distributed systems, networking, and graphical user interfaces. Brad can be reached at btopol@us.ibm.com.
Chris Seekamp (seekamp@us.ibm.com), Programming Consultant, IBM
Chris Seekamp is a programming consultant and has been the primary developer and development team leader for the text-related portions of the WebSphere Transcoding Publisher since early in its development. He has been applying object-oriented design and development techniques for over 10 years, doing so the last several years on projects involving mobile devices. Chris can be reached at seekamp@us.ibm.com.
Steve Ims (steveims@us.ibm.com), Software Engineer, IBM
Steve Ims
Steve Ims is a senior software engineer at IBM. As a member of a WebSphere Advanced Technology group, he spends his days contemplating next-generation services around the J2EE framework: Intelligent network infrastructure, mobile computing, and legacy integration. He's particularly interested in raising the usability of these services through integration with software tools. At night, he just tries to keep up with his perpetual-motion children. You can reach Steve at steveims@us.ibm.com.

Summary:  Many Web sites and browser-based applications are designed for desktop displays that support large windows, high-resolution images, and thousands of colors. But for users who need to access these sites from a mobile phone or personal digital assistant (PDA), these sites often prove frustrating at best and unusable at worst. With WebSphere Transcoding Publisher's XML-based annotation language, you can identify and extract specific portions of a document, all without having to touch the HTML source. In this article, we describe how the annotation language works and provide several examples of how you can use annotation to tailor Web content for different devices.

Date:  01 Apr 2001
Level:  Intermediate
Activity:  749 views

About WebSphere Transcoding Publisher
Operating systems: AIX, Linux, Microsoft Windows NT and Windows 2000, OS/400, Sun Solaris
Technologies: Java, XML, Wireless

People are using more portable devices like phones and PDAs to access the Web, whether it's to check the latest sports scores, do some personal banking, or trade stocks. Businesses are also looking for ways to take advantage of the growing wireless wave by extending their existing applications to a remote workforce. With the rapid growth in the number of portable devices, the ability to tailor content to these small displays is increasingly important. One way content providers support these devices is to maintain separate sets of information for the different devices, but this can be expensive, both in creating and maintaining the information.

To deal with this problem, IBM WebSphere Transcoding Publisher includes support for document clipping, which lets you identify and extract specific portions of a document. In this way, you can keep those pieces of information that are important while discarding images or complicated markup that the client device can't display. This simplifies what is sent to the device while reducing the amount of data being transmitted, a particularly useful aspect when wireless devices are involved. Moreover, no changes to the source content are required.

Transcoding Publisher gives you two ways to perform document clipping:

  • With an annotation language composed of special XML tags
  • With a text clipper, which is a custom transcoder that can add, change, and delete parts of a text document. A transcoder is a program written in Java that Transcoding Publisher uses to perform various transformations. (A full discussion of text clippers is beyond the scope of this article, although the following section gives more information about text clippers and how they compare to annotation.)

Here, we describe the XML tagging in Transcoding Publisher's annotation language and then go on to show you several practical examples that demonstrate how you can easily tailor the content of a Web page. To get the most out of this discussion, you'll need a general understanding of markup languages like HTML and XML, as well as the related concepts of the XML Path Language (XPath) and the Document Object Model (DOM) (see Resources).

What's annotation, anyway?

An annotation is composed of a special set of XML tags that, when combined with an HTML source file, dictate which parts of the HTML document should be clipped. You can put all of your annotations for a particular document in a separate annotation file called an annotator (referred to as external annotation) or you can embed annotations directly in the HTML file itself as comments (referred to as internal annotation). For most of the examples in this article, we'll be using the syntax for external annotation, though later we also include an internal example.

Annotation vs. text clipping

While annotation can handle most common clipping tasks, it may not always provide the capability or flexibility you require. With annotation the changes you can make to the DOM are limited by the capabilities provided by the annotation language. This is where text clipping comes in. Text clipping is more powerful than clipping with annotation, because you have the capability to make unlimited changes to the document. With text clipping you use the org.w3c.dom Java bindings (see Resources) to manipulate the DOM or work with the text (String) form of the document. If you wanted to, you could replace every element in the DOM. However, a text clipper is not limited to the DOM alone; you can extend the reach of the clipper so that you could modify the HTTP header on a request, for example, or you could specify that the response not be cached.

As you might expect, the added flexibility of a text clipper is not without tradeoffs. Because text clippers are written in Java, some programming experience is required, along with an understanding of how to create and use subclasses. Basic knowledge of how Transcoding Publisher handles requests is also necessary, and you'll need to be familiar with the way Transcoding Publisher manages text clippers and other transcoders. Finally, to make your text clipper available to others, you'll need to know how to package the text clipper files in a Java Archive (JAR) or ZIP file.

If you need more flexibility in modifying a document than annotation affords, a text clipper is worth considering, particularly if your clipping task is only one part of a larger, integrated process. There may even be times when you'll want to use annotation and a text clipper together to leverage their respective strengths.

How does it work?

Before we get into the annotation language itself, there are a few things about how Transcoding Publisher processes the annotations that you need to know. When Transcoding Publisher receives an HTML document for clipping, the first thing it does is generate an HTML DOM of the document. The DOM defines the logical structure of the document, representing the document in a hierarchical structure of related nodes. The annotator contains XPath statements that identify the nodes in the HTML DOM that are to be manipulated. When processing an annotator, Transcoding Publisher goes through the DOM and applies the annotations by deleting, adding, and changing nodes as required. When it's finished making its changes, Transcoding Publisher sends the clipped document to the requesting device, either in HTML format or in any other format supported by Transcoding Publisher, such as the Wireless Markup Language (WML) format used by Wireless Application Protocol (WAP) devices.

Let's look at a brief example to see how this works. The following Web page shows a simple arrangement of text and two tables.


Figure 1. Simple Web page as it appears before annotation

So, if you wanted to keep the heading text and then get rid of everything but the second table, you could do so with the following statements in an annotation file:


Annotations for simple annotation example

<description take-effect="after" target="/HTML[1]/BODY[1]/*[1]">
     <remove /> 
</description>

<description take-effect="before" target="/descendant::TABLE[2]">
     <keep />
</description>

<description take-effect="after" target="/descendant::TABLE[2]">
     <remove />
</description>

As the sample above shows, the <description> tag is the primary element in the annotation language. The target attribute identifies the node on which the annotation will be applied, and the take-effect attribute indicates whether the annotation is applied before or after the target node. By specifying target="/HTML[1]/BODY[1]/*[1]" as in our example, you activate clipping after the first node after the <BODY> tag, which in this case is an <H1> ("Simple Annotation Example"). The asterisk in *[1] is a wildcard that you can use so that you don't have to know exactly what kind of node it is.

The <remove /> tag indicates that all tags encountered are to be removed, until otherwise instructed by another annotation statement. In the example, we stop clipping just before the second table with a <keep /> tag and then resume clipping after the second table. It's worth noting that when you use take-effect="after", any manipulation will be performed after the closing tag of the target node, so in this case, any elements inside the second table are safely ignored.

How does the annotated Web page look? Figure 2 shows the resulting page, as it would appear on a desktop browser.


Figure 2. Simple Web page as it appears after annotation

The use of the <keep /> and <remove /> tags in our example illustrates the notion of a clipping state, one of the main ideas behind the annotation function. The clipping state indicates whether the content being processed should be preserved or removed, while providing an easy means of keeping or discarding elements of a document, particularly when you want to act on large chunks at one time. For example, if you were clipping a document for display on a mobile phone, you might activate clipping with the first node within the <BODY> element (<description take-effect="before" target="/HTML[1]/BODY[1]/*[1]">) and then only keep a few selected pieces thereafter.

What else it it good for?

Although the bulk of what you'll do with the annotation language is likely to be turning clipping on and off as we've already discussed, there are other ways you can manipulate a document with the language, both for more finely grained clipping and for specialized treatment of particular elements.

Other features of the language include:

  • The ability to specify exception tags that will keep or remove HTML elements on a global basis for a clipping region. For example, <remove tag="IMG" /> specifies that all images are to be removed, regardless of where they occur, until another <keep /> or <remove /> instruction is encountered.
  • Special handling of form elements. The annotation language enables you to explicitly reorder, hide, or preset any field in an HTML form. In addition, you can change how a field is represented; for example, you can change a <textarea> field into a <select> menu and change the label text associated with the field.
  • Shortcut annotations that streamline the annotation of tables. With these shortcut annotations, you can mark entire rows or columns for clipping. A statement like <column index="2" clipping="remove" /> removes the second column of the selected table. Using XPath statements to perform the same task can be tedious.

Sample annotator: IBM stock quote

While the example described above demonstrates the fundamental ideas behind Transcoding Publisher's annotation support, it's time to take a look at more practical ways of applying annotation. Figure 3 shows the IBM stock quote page from the company's corporate Web site (http://www.ibm.com/ibm/stock).


Figure 3. Stock quote page from www.ibm.com

This page contains a lot of information, including details of the stock's performance, a search field, and numerous links to other areas of the site that might be of interest to the user. However, let's say you want to access this page through a Web-enabled phone rather than a desktop browser. Now the nested tables and images that made for a nicely laid out page are a hindrance rather than a help, and the sheer amount of information available becomes unwieldy in the small display (and potentially expensive, depending on your wireless service). Who wants to scroll through pages of information when all you're interested in are a few choice nuggets?

Fortunately, paring down the information on this page is not difficult. Suppose all you want to know is the price of the last trade and whether the price has moved up or down. The sample below shows the code for the annotator you can use to extract those details.


Annotator for IBM stock example

<?xml version='1.0' ?>
<annot version="1.0">
    <description take-effect="before" target="/HTML[1]/BODY[1]/*[1]">
        <remove />

    </description>
    <description take-effect="before" target="/descendant::IMG[1]">
        <keep />
        <replace>
           <text>IBM Stock Quote</text>
        </replace>
    </description>

    <description take-effect="after" target="/descendant::IMG[1]">
        <remove />
    </description>
    <description take-effect="before" target="/descendant::TABLE[8]">
        <keep />
        <table>
            <column index="*" clipping="remove" />
            <column index="2" clipping="keep" />
            <column index="3" clipping="keep" />
            <row index="*" clipping="remove" />
            <row index="6" clipping="keep" />
            <row index="7" clipping="keep" />
            <row index="8" clipping="keep" />
        </table>
    </description>

    <description take-effect="after" target="/descendant::TABLE[8]">
        <remove />
    </description>
</annot>

This annotator is doing several things:

  1. Clipping starts immediately after the <BODY> tag. In effect this sets the clipping state to remove, which will remain in effect until you explicitly change it with another annotation.
  2. As it happens, the heading for the page ("Stock quote") is actually an image. Because you don't want to have to download any unnecessary images to the phone, use annotations to turn clipping off before the <IMG> element and then replace that element with a <text> node with an alternate heading. In fact, since you don't really care which image you replace with your heading text, you operate on the first image you encounter, just to keep things as simple as possible. Then you turn clipping back on.
  3. After looking carefully through the HTML source, you find that the information you want is in the eighth table, so just before you hit this table, turn clipping off. While you want to keep this table, you don't necessarily want to keep every row or every column in the table. You can use special table annotations to indicate that you want to delete all the columns in the table (<column index="*" clipping="remove" />, and then indicate the specific columns that you do want to keep (for example, <column index="2" clipping="keep" /> will keep the second column). Then you do the same for the rows that you want to keep.
  4. When you're finished with the table, turn clipping back on just after the table to remove everything else.

The resulting page is a much smaller HTML document. By taking advantage of Transcoding Publisher's ability to automatically convert HTML documents to WML format, you can display the annotated Web page on a WAP-based phone. Figure 4 shows how the IBM stock quote would appear on the phone. While still true that you have to scroll once or twice to see all of the information, it's obvious this is preferable to the excessive scrolling required to work through the original page. If you wanted to clip at an even more granular level, you could use annotations to target specific cells in the table.


Figure 4. Annotated IBM stock quote page, as viewed on mobile phone


Doing more with XPath

The IBM stock page sample demonstrates how you can use Transcoding Publisher's annotation language to quickly and easily create a clipped version of an HTML document. However, if the structure of the original page were to change, the annotator might not work anymore. For example, if another table was added somewhere before the target table, the statement <description take-effect="before" target="/descendant::TABLE[8]"> would key on the wrong table, yielding irrelevant or even nonsensical information.

Rather than having to constantly adjust our annotator every time the target Web page is modified, you could write an annotator that can allow for changes in the content or layout of the page. Fortunately, the XPath language gives you a way to do just that.

Sample annotator: matching on comments

One way to take advantage of XPath's capabilities is to key on elements of the HTML page that are consistent, even when the majority of the content and layout change frequently. The Web page shown in Figure 5 is one such example. In this case, the Web page developer has included HTML comments that provide metadata. This metadata provides details regarding what the content is on the page.


Figure 5. XPath example using comments

For example, the HTML page has on it an introduction section, and two top story sections. Each section is bounded by comments that describe the section. For example, the first top story has a <!-- begin story 1 --> comment immediately before it and an <!-- end story 1 --> immediately after it. This convention is used throughout the page to delimit various parts of the document.

Because the XPath language can be used to identify comments, as well as match patterns contained in the comments, you can leverage the content in comments to serve as locations for our XPath annotations. In our example, we use <keep /> and <remove /> statements that key on the stories we want to keep, as shown in the annotator for XPath comments example.

The XPath instruction target="/descendant::comment()[contains(.,'begin story 1')]" indicates that the clipping is turned off just before the first element ("descendant") that is a comment containing the string "begin story 1." Clipping is then turned back on after the "end story 1" comment. The same is then done for other stories of interest. Figure 6 shows what the resulting clipped page looks like.


Figure 6. Annotated page based on comments

This results in annotations that keep the top story sections, and these annotations work regardless of how the rest of the page changes over time (as long as the same commenting convention is used).

Sample annotator: matching on element content

Sometimes you might want to leverage the relationships between elements of a page when applying annotations. The Web page in Figure 7 is an example where we want to use the XPath language to identify a table we want to keep.


Figure 7. XPath example with multiple tables

Although the number of tables on the page changes over time, we know the table we want always follows the string "Weather." The sample code, annotator for XPath tables example, shows how we used XPath statements to identify the table based upon this characteristic.

The statement target="/descendant::text()[contains(.,'Weather')]/following::TABLE[1]" enables us to look specifically for the text element containing the string "Weather" and then apply our <keep /> annotation to the first table that follows the identified text. Similarly we can use the same notation to apply the <remove /> annotation, with the only difference being whether the take-effect attribute indicates that the annotation be applied before or after the targeted table. The resulting annotated page is shown in Figure 8.


Figure 8. Annotated page with only desired table retained

This way we can be sure we keep the desired table, even if other tables are added to the page, either before or after the table we want.


Annotating from the inside out

Although what we've been describing up until now has been external annotation, you can also embed annotations directly in the HTML file itself. External annotation is great for working on content you're not responsible for creating, but if you're also creating the content you're delivering, internal annotation has advantages which you ought to consider.

Okay, so what's the difference between internal and external annotation?

Not much. Functionally, internal and external annotation are equivalent. You add internal annotations as comments to the HTML file that you're clipping, and Transcoding Publisher operates directly on those annotations. On the other hand, to process external annotations, Transcoding Publisher has to merge them with the HTML DOM according to the attributes of the description tags before it can process the clipped document.

A good way to see how the two approaches differ is to take a look at an example of internal annotation. Consider again the simple annotation example shown in Figure 1. Here, we show the HTML file with internal annotation corresponding to the first remove instruction from the sample external annotation.


Internal annotation for the simple annotation example

<HTML>
<HEAD>
  <TITLE>WebSphere Transcoding Publisher -- Simple Annotation Example</TITLE>
  <META name="Annotation_v1.0" content="keep">                         
</HEAD>
<BODY>
  <H1>Simple Annotation Example</H1>
<!-- <?xml version="1.0"?><annot version="1.0"><remove /></annot> -->
<P>This Web page demonstrates how you can use WebSphere Transcoding
  Publisher's annotation support to keep some parts of a page while
  discarding others.</P>
<P>For example, we might want to ...

The red highlights show annotation-related markup embedded within the HTML file.

  • The META tag specifies the default clipping state for the document, as well as the level of the annotation language that's used. This META tag is Transcoding Publisher's clue that the file contains internal annotations. If the tag is not in the HEAD section of the file, Transcoding Publisher will not bother to search for any other annotations in the HTML document.

  • The remove instruction is contained within a mini-XML document wrapped in HTML comments. Its location marks the point at which we'll begin removing content. Note that this location is the same as that specified in the description from the sample external annotation (<description take-effect="after" target="/HTML[1]/BODY[1]/*[1]">).

If you use IBM WebSphere Studio, you're already a step closer to creating your own internal annotation. Version 3.5.2 (and later) of WebSphere Studio incorporates visual support for internal annotation. You can manipulate HTML files and JavaServer Pages by clicking on the element you want to annotate (say, an image or table) and then marking it to be kept or removed. Studio takes care of adding the appropriate lines to the file and sorts out the pesky details like syntax.

Annotating dynamic pages

You can also use internal annotation to perform clipping on dynamic documents, such as JavaServer Pages. This is especially useful when dealing with dynamic tables, where the number of rows and columns might vary when the page is generated. In this case, internal annotation (keep/remove) could be placed around the repeating rows and columns instead of using the table annotation, which assumes that you already know the structure of the table.


Conclusion

The landscape of mobile devices is changing fast, and if you're trying to deliver content to these devices, it can be hard to keep up. This article shows you a way you can deal with this challenge by describing how document clipping can be applied to Web pages with Transcoding Publisher's annotation language. Whether it's extracting key chunks of information for display on a small screen or replacing images with text to improve download performance, annotation is a quick and easy way of getting the job done.


Resources

About the authors

Richard Spinks is a software engineer for IBM in Research Triangle Park, North Carolina, and is currently a user assistance developer working on WebSphere Transcoding Publisher. In his so-called spare time Richard does graduate work n Information Science at the University of North Carolina at Chapel Hill, where his interests include the development of user interfaces for managing and searching digital video collections. You can reach him at rspinks@us.ibm.com.

Brad Topol

Brad Topol is a senior software engineer in the WebSphere Advanced Technology group. Currently, he is actively involved in advanced technology projects in the areas of transcoding, distributed systems, networking, and graphical user interfaces. Brad can be reached at btopol@us.ibm.com.

Chris Seekamp is a programming consultant and has been the primary developer and development team leader for the text-related portions of the WebSphere Transcoding Publisher since early in its development. He has been applying object-oriented design and development techniques for over 10 years, doing so the last several years on projects involving mobile devices. Chris can be reached at seekamp@us.ibm.com.

Steve Ims

Steve Ims is a senior software engineer at IBM. As a member of a WebSphere Advanced Technology group, he spends his days contemplating next-generation services around the J2EE framework: Intelligent network infrastructure, mobile computing, and legacy integration. He's particularly interested in raising the usability of these services through integration with software tools. At night, he just tries to keep up with his perpetual-motion children. You can reach Steve at steveims@us.ibm.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Sample IT projects
ArticleID=10070
ArticleTitle=Document clipping with annotation
publish-date=04012001
author1-email=rspinks@us.ibm.com
author1-email-cc=
author2-email=btopal@us.ibm.com
author2-email-cc=
author3-email=seekamp@us.ibm.com
author3-email-cc=
author4-email=steveims@us.ibm.com
author4-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).