An IBM Mashup Center plug-in to convert HTML to XML

An in-depth look at developing an IBM Mashup Center plug-in

Learn how to build a plug-in for the IBM® Mashup Center that can convert HTML into XML, opening the door for some simple data extraction from HTML pages using the Feed Mashup Editor.

Louis Mau (louismau@us.ibm.com)IBM

Louis Mau is part of the InfoSphere MashupHub development team. His current focus is to help customers build situational applications using the IBM Mashup Center. Prior to this role, he was the architect for DB2 Everyplace Sync Server.



18 August 2009

Also available in Portuguese Spanish

Introduction

IBM Mashup Center comes with feed generators that can access and generate XML feeds directly from many enterprise data sources. At the same time, given the diversity of data stores and software, there will be data sources that cannot be accessed by these built-in generators. To allow you to augment the capability of the IBM Mashup Center, the feed generation capability can be extended by the addition of plug-ins.

This article is a follow on to the article " Extend the reach of data for IBM Mashup Center" and is based on V1.1 of the software. It assumes that you are already familiar with the basics of writing an IBM Mashup Center plug-in. In particular, you should know how to program in Java™, JSP, and JavaScript. The article shows you how to develop a plug-in to convert HTML into XML, and uses this example to illustrate the writing of a more complex plug-in. As a side benefit, once the HTML is in XML format, it can be read into the Feed Mashup Editor, permitting data extraction.


Tools for converting HTML to XML

We will need a Java package that will convert HTML to XML. There are a number of such Java packages. This project uses JTidy, a Java port of HTML Tidy from W3C. HTML Tidy started as a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. It supports, in addition, the generation of XML. You'll find a link in the Resources section to download and install JTidy.

We will be using the Tidy.jar in the build folder. To make the JAR self-contained, it contains an older version of W3C Document Object Model (DOM) classes. Since more recent version of the W3C DOM classes are now included with the JDK, classes from the package org.w3c.dom should be removed.


Setting up an Eclipse project

As described in Section 6.1 of the Application Programming Interface Reference, Version 1.0, the feed generation framework automatically searches for ZIP files containing third party plug-ins placed in the special folder <WebApplication>/WEB-INF/plugins. The ZIP archive must have the folder structure specified below:

  • /client/plugins/PLUGIN_DIR -- Contains files for browsers, like images, JavaScript files and so on.
  • /server/plugins/PLUGIN_DIR -- Contains files used by the plug-in to render itself (HTML files, JSP pages). Additional folders for plug-in files can be included.
  • /WEB-INF/classes -- Contains the plug-in Java classes. This can be a hierarchy of folders. The classes will be copied to <WebApplication>/WEB-INF/classes.
  • /WEB-INF/lib -- Contains JAR files used by the plug-in (third-party).

To simplify the final build and packaging of the plug-in, you might want to create a project using your favorite IDE having the same directory structure as required by the final ZIP archive. Figure 1 shows the layout of the Eclipse project created by the author.

Figure 1. Eclipse project
show hierarchical structure of Eclipse project

Note that the PLUGIN_DIR must be unique and the same as the package name of the plug-in. For this example, I used the package name sample.mashupcenter.tidyhtml. Observe also that we have placed Tidy.jar in the WEB-INF/lib folder and created a folder named lib_noship containing two additional JARs which are provided by the Mashup Center server during runtime but unavailable in the development environment. These JARs should not be included in the final deployment ZIP file. (In fact, loading of the plug-in ZIP file would fail if they were included by mistake.)


The plugin.xml file

Every Mashup Center plug-in has two major operations: an editor to collect creation parameters, and a generator to create the feed. (Feed normally refers to an xml document conforming to the RSS or ATOM specification. Note that in this case, the generated XML will originate from HTML and will conform to RSS or the ATOM specification.) The plug-in framework finds out which Java classes implement the two key operations by reading the package.xml file, which must be placed in the server/plugins/PLUGIN_DIR folder. Listing 1 shows the package.xml file for this plug-in.

Listing 1. Package XML file
<plugin>
  <name>Tidy Html</name>
  <author>L. Mau</author>
  <version>1.0</version>
  <category>departmental</category>
  <editor>Html2XmlEditorPlugin</editor>
  <generator>Html2XmlGeneratorPlugin</generator>
  <description>convert any Html page into xhtml</description>
  <icon16path>/plugins/sample.mashupcenter.tidyhtml/icons/btn16_hello.gif</
	icon16path>
  <icon32path>/plugins/sample.mashupcenter.tidyhtml/icons/btn32_hello.gif</
	icon32path>
  <icon64path>/plugins/sample.mashupcenter.tidyhtml/icons/btn64_hello.gif</
	icon64path>
  <objectType>feed</objectType>
</plugin>

It is worth mentioning that the name, description, and version elements are for the benefit of the creator and not used by the IBM Mashup Center plug-in framework. The plug-in framework uses the plugin.uiname properties inside the ui.properties file as the name of the plug-in when presenting the list of options after users select New Feed.

The ui.properties file resides in the /server/plugins/PLUGIN_DIR/nls folder and is loaded using the standard Java resource bundle loading convention. For each supported language, place the translated string for plugin.name in a properties file with the locale appended to "ui." For example, the Japanese version of the file should be named ui_ja.properties.


Implementing the editor

The Html2XmlEditorPlugin extends BaseEditorPlugin, a base class that requires us to implement the renderEditor method.

Listing 2. Html2XmlEditorPlugin class
package sample.mashupcenter.tidyhtml;

import : : : : :   // omitted from listing

/**
 * This plugin uses JTidy to convert Html to Xml.
 */
public class Html2XmlEditorPlugin extends BaseEditorPlugin {

    private static final Log log = LogFactory.getLog(Html2XmlEditorPlugin.class);

    public  static final String I18N_RESFILE = Html2XmlConstants.PLUGIN_NAME
                                         + ".nls.tidyhtml";
    public  static final String HTTP_BASEURL = "plugins/"
                                         + Html2XmlConstants.PLUGIN_NAME +  "/";
    public  static final String RES_BASEURL  = "/" + HTTP_BASEURL;
    public  static final String HELPPATH     = HTTP_BASEURL + "help/tidyhtml.htm";

The first statement in the class creates a static log instance from the Apache common logging package. This is the same logging infrastructure as used by the feed generation framework. Log messages will be interleaved with those from the feed generation framework and will be written to the following file <WebApplication>/META-INF/logs/javamashuphub.log. By default, only messages from WARN and above (for example ERROR) will be written to the log file.

The next statement is the string constant to a resource file. Even if there is no need to translate all the string constants to different languages, it is still good programming practice to keep all text strings displayed by the user interface in a separate resource file. Note that the file tidyhtml.properties is placed in the same directory as the ui.properties file described earlier.

The last few string constants define paths to various resources needed by the plug-in. Note that the runtime location of those resources mimics the structure of the plug-in ZIP file.


renderEditor method

The renderEditor method takes two parameters: RequestData and Entry. This method is called by the framework when users select to create a new feed using this plug-in, or when users edit an existing feed previously created by this plug-in. As we will see, the method takes two parameters of types RequestData and Entry. The two parameters are actually common to all methods invoked by the framework in response to user actions. RequestData contains information sent from the browser, and Entry contains all the information maintained by the framework for this feed instance.

Listing 3. renderEditor method body
public ViewBean renderEditor(RequestData rdata, Entry entry)
{
    ResourceBundle i18n = ResourceBundle.getBundle(I18N_RESFILE,rdata.getLocale());
    String pluginId = this.getId();

    Html2XmlUrlViewBean htViewBean = new Html2XmlUrlViewBean();
    htViewBean.setEntry(entry);
    htViewBean.setHtmlUrl( entry.getAttribute(Html2XmlUrlViewBean.PARAM_HTMLURL ) );
    htViewBean.setSnapshot( entry.getAttribute(Html2XmlUrlViewBean.PARAM_SNAPSHOT ) );

    FormViewBean form = new FormViewBean();
    form.setSuffix( htViewBean.getSuffix() );
    form.addComponent( htViewBean );
    form.setOnsubmit( PluginHelper.getClientMe(pluginId, entry.getObjectId()) +
            ".invokeServer('displayHtmlPage',"  +
            PluginHelper.getClientId(pluginId, entry.getObjectId()) +
            "_" + form.getSuffix() + ");");
    form.setEntry(entry);

    FrameViewBean frame = new FrameViewBean();
    frame.addComponent(form);
    frame.setLabel(entry.getTitle());
    frame.setTitle(i18n.getString("frame.urltitle"));
    frame.setEntry(entry);
    frame.setHelpPath( HELPPATH );
    return frame;
}

The method returns an instance of type ViewBean. ViewBean is similar to Java Bean with getters and setters for display properties. Its main purpose is to specify the JSP used by the feed generation framework to create HTML for the plug-in specific editor. Since the renderEditor method could be called to edit an existing instance, it retrieves any previously saved data for this instance by calling the Entry::getAttribute method. We will see later when and how these data are saved. The retrieved value are then passed to the Html2XmlUrlViewBean so that the associated JSP can display the value previously supplied by the user. Note that the plug-in specific Html2XmlViewBean is not returned directly, but instead is wrapped inside a FormViewBean instance via the addComponent method. The FormViewBean provides the custom JavaScript logic to send the user-entered information to the plug-in when the Next button from the wizard-like editor interface is clicked. Finally, the FormViewBean is in turn wrapped inside a FrameViewBean. It is the latter which is returned.

One last note before moving on. We call the setOnsubmit method to provide a chunk of JavaScript code to execute when the Next button is clicked. The JavaScript code calls the hub.managers.InvokePlugin's invokeServer function described in Section 6.3.2 of the Application Programming Interface Reference. The first parameter specifies the displayHtmlPage method in this class which will be used to service the next editor page.


Html2XmlUrlViewBean and its associated JSP file

We have briefly described the Html2XmlUrlViewBean already in the previous section. Listing 4 shows part of the class definition:

Listing 4. Html2XmlUrlViewBean
public class Html2XmlUrlViewBean extends ViewBean
{
  public static final String PARAM_HTMLURL     = "htmlurl";
  public static final String PARAM_SNAPSHOT    = "snapshot";
	
  private String  htmlUrl;
  private String  snapshot;

  public Html2XmlUrlViewBean()
 {
    this.setI18NProperties( Html2XmlConstants.PLUGIN_NAME + ".nls.tidyhtml");
 }

  /* (non-Javadoc)
   * @see com.ibm.mashuphub.model.ViewBean#getJSPPath()
   */
  @Override
  public String getJSPPath() {
    return "/server/plugins/" + Html2XmlConstants.PLUGIN_NAME + "/tidyhtmlUrl.jsp";
 }

 public String getSuffix() {
    return "tidyhtmlUrl";
 }

The getJSPPath method will be called by the FormViewBean when it tries to generate the HTML form for gathering these plug-in specific parameters. The getSuffix method should return a string unique among the various ViewBeans from this plug-in. Before looking at the associated JSP file, it helps to first look at the rendered HTML form:

Figure 2. InfoSphere MashupHub first editor page
Screen capture of page with input of url

Notice that the form has two input elements:

  • A textfield for collecting the URL for the HTML the user wants to convert to XML, and
  • A checkbox to indicate that we will save the generated XML at the first invocation and will simply return the XML in subsequent feed generation requests. This is appropriate when the HTML page is static and rarely changes.

Now that we have seen what is to be generated, it is much easier to understand the JSP file.

Listing 5. tidyhtmlUrl.jsp
<%@page import="sample.mashupcenter.tidyhtml.Html2XmlUrlViewBean"%>
<%
    Html2XmlUrlViewBean htViewBean = new Html2XmlUrlViewBean();
    htViewBean = (Html2XmlUrlViewBean) htViewBean.getViewBeanFromRequest(request);
ResourceBundle i18n = ResourceBundle.getBundle(htViewBean.getI18NProperties(),
                                               request.getLocale());

    String objectId = htViewBean.getEntry().getObjectId();
String id = com.ibm.mashuphub.helper.PluginHelper.getClientId(
                                            htViewBean.getPluginId(), objectId);
%>

<br/>

<label for='htmlurl'><%=i18n.getString("form.htmlurl.label") %></label>
<div  class="rightCol">
   <input type='text'
          id='<%=id%>_htmlurl'
          name='<%= Html2XmlUrlViewBean.PARAM_HTMLURL %>'
          value='<%= htViewBean.getHtmlUrl() %>'
          maxlength='256' style='width=600px;' />
</div>

<div   class="rightCol">
   <input type='checkbox'
          id='<%=id%>_snapshot'
          name='<%= Html2XmlUrlViewBean.PARAM_SNAPSHOT %>'
          value='y'
          <%= "y".equals(htViewBean.getSnapshot()) ? "checked" : "" %>  />
   <%= i18n.getString("form.snapshot.label") %>
</div>

Ignoring the import statement, the purpose of the first two statements is to retrieve the ViewBean associated with the JSP. It differs slightly from the way JSPs typically retrieve their associated Java bean, that is from the request object. Corresponding to the two form input elements, there are two HTML input elements of type text and checkbox respectively. Note that we use the constants PARAM_HTMLURL and PARAM_SNAPSHOT from the class Html2XmlUrlViewBean to name the two input elements. These names will appear as names in the URL query string sent when the Next button is clicked. Using string constants is the best way to ensure that they correspond exactly to what the server expects. Lastly, we initialized these input elements using the potentially previous value retrieved by the renderEditor method.


DisplayHtmlPage method

I mentioned in an earlier section that the displayHtmlPage method in the Html2XMLEditorPlugin class will be used to service the next editor page. The method displayHtmlPage is not inherited from the base class BaseEditorPlugin and takes two parameters of type RequestData and Entry. An EditorPlugin can introduce any number of public methods with the same signature. All such methods may be invoked by the client running on the browser through an AJAX call.

Listing 6. displayHtmlPage method
public  ViewBean  displayHtmlPage(RequestData rdata, Entry entry)
{
    ResourceBundle i18n = ResourceBundle.getBundle(I18N_RESFILE,rdata.getLocale());
    String pluginId = this.getId();

    // do not use "url" since the latter got intercepted in RequestData.init();
    String  sHtmlUrl  = rdata.getParameter( Html2XmlUrlViewBean.PARAM_HTMLURL );
    String  snapshot  = rdata.getParameter( Html2XmlUrlViewBean.PARAM_SNAPSHOT );
    log.debug("snapshot,sHtml=" + snapshot + "," + sHtmlUrl );

    Html2XmlContentViewBean htViewBean = new Html2XmlContentViewBean();
    htViewBean.setEntry(entry);
    htViewBean.setHtmlUrl( sHtmlUrl );
    htViewBean.setSnapshot( snapshot );

    FormViewBean form = new FormViewBean();
    form.setSuffix( htViewBean.getSuffix() );
    form.addComponent( htViewBean );
form.setOnsubmit(PluginHelper.getClientMe(pluginId,
                                          entry.getObjectId())+".submit();");
    form.setEntry(entry); // must be set, used to init JS plugin object

    FrameViewBean frame = new FrameViewBean();
    frame.addComponent(form);
    frame.setLabel(entry.getTitle());
    frame.setTitle(i18n.getString("frame.tabtitle"));
    frame.setEntry(entry);
    frame.setHelpPath( HELPPATH );

    JSONAJAXResponseViewBean ajaxViewBean = new JSONAJAXResponseViewBean();
    ajaxViewBean.setMethod(JSONAJAXResponseViewBean.METHOD_SHOW_EDITOR);
    ajaxViewBean.setCode( JSONAJAXResponseViewBean.PAGE_CONTENT );
    ajaxViewBean.addComponent(frame);
    return ajaxViewBean;
}

The purpose of this method is to render a second editor page for users to verify the content of the retrieved HTML. Accordingly, the return type has to be ViewBean. The logic inside the displayHtmlPage method is similar to the renderEditor method we discussed earlier with three notable differences:

  • Instead of retrieving previously entered configuration values from the Entry instance, we retrieved what the user entered during this editing session by calling the RequestData's getParameter method. These parameters correspond to the input elements in the JSP form sent via an AJAX call to the server.
  • Each page requires a different ViewBean. This method instantiate an instance of Html2XmlContentViewBean. As before, it has to be wrapped inside of a FormViewBean, FrameViewBean chain. In addition, we need to further wrap the FrameViewBean in a JSONAJAXResponseViewBean instance. The latter happened automatically in the renderEditor method but needs to be explicitly done here.
  • Since we will be providing our own JavaScript, we show a slight variation in the JavaScript passed to the setOnsubmit method. Instead of calling invokeServer directly, we will be calling the submit method in the associated JavaScript.

One additional detail worth pointing out is the call to the static logger instance to log user specified parameters to help with problem determination.


Html2XmlContentViewBean and the associated JSP

The Html2XmlContentViewBean is fairly simple and basically just returns a different JSP path and suffix from the Html2XmlUrlViewBean we looked at earlier. The reader can examine it by downloading the attached package and we will not dwell on it further. The editor page to be generated is also simple, consisting of an area to display the retrieved HTML. The following screen shot shows one corner of the display area:

Figure 3. Preview HTML content page
InfoSphere MashupHub screen cap illustrating corner of the HTML content page

We next examine the associated JSP file tidyhtmlContent.jsp. To generate the display area, you can see that the associated JSP simply includes a single div element at the bottom of the JSP file. Since we will be using the id attribute later, this is a good place to discuss its construction.

The id attribute must be unique among all HTML elements within a browser window. Using the id, the browser provided API can retrieve the HTML elements as JavaScript DOM objects, allowing dynamic manipulation. Since a user could have multiple instances of a given plug-in editor opened at the same time, HTML elements in the JSP template will be instantiated multiple times. To ensure that ids of such elements are unique, we call the PluginHelper's getClientId method to retrieve the unique feed instance id and append it to the id.

Listing 7. tidyhtmlContent.jsp
<%@page import="sample.mashupcenter.tidyhtml.Html2XmlContentViewBean"%>
<%
    Html2XmlContentViewBean htViewBean = new Html2XmlContentViewBean();
    htViewBean = (Html2XmlContentViewBean) htViewBean.getViewBeanFromRequest(request);
ResourceBundle i18n = ResourceBundle.getBundle(htViewBean.getI18NProperties(),
                                               request.getLocale());

    String objectId = htViewBean.getEntry().getObjectId();
String id = com.ibm.mashuphub.helper.PluginHelper.getClientId(
                                             htViewBean.getPluginId(), objectId);
String me = com.ibm.mashuphub.helper.PluginHelper.getClientMe(
                                             htViewBean.getPluginId(), objectId);

    String   snapshot    = "\"" + htViewBean.getSnapshot() + "\"";
    String   htmlUrl     = htViewBean.getHtmlUrl();
    htmlUrl     = ( htmlUrl == null ?  "\"\""   :  "\"" + htmlUrl + "\"" ); 
%>
<script type="text/javascript">

    dojo.registerModulePath("plugins.tidyhtml" ,
                 "../../../../client/plugins/sample.mashupcenter.tidyhtml/script");
    dojo.require("plugins.tidyhtml.PreviewHtml");

    new plugins.tidyhtml.PreviewHtml(
               <%= me %>.plugin_id,
               <%= me %>.entry_id,
               <%= me %>.workflow);

    <%=me%>.init( <%= htmlUrl %> , <%= snapshot %> );
    <%=me%>.onLoadEditor();

</script>

<div id='<%=id%>_htmlContent' style='width:100%;
     overflow:auto; border: 2px  solid #000000;'>
</div>

One new aspect of this JSP is the inclusion of custom JavaScript to be run on the client side. The IBM Mashup Center feed generation framework uses the Dojo AJAX package. See the Resources section for the link to the Dojo documentation. We will be using the Dojo AJAX package in our custom JavaScript. Most of the custom JavaScript resides in a Dojo class named "plugins.tidyhtml.PreviewHtml".

To use it, we need to import it using a dojo.require function call. The Dojo registerModulePath function call is used to tell Dojo how to locate classes from the "module" plugins.tidyhtml. Note that the specified path is relative to where the Dojo package is located and hence requires the backward reference "../../../..". The above initialization logic is generated inline enclosed inside a script tag. In addition, the inline JavaScript creates an instance of the PreviewHtml class and calls its init and onLoadEditor methods. The next section examines in greater detail the PreviewHtml class.


PreviewHtml Dojo class

The PreviewHtml Dojo class inherits from the hub.managers.InvokePlugin class which is part of the client side feed generation framework. The InvokePlugin class is further described in section 6.3.2 of the Application Programming Interface Reference, Version 1.0. The methods of importance in the PreviewHtml Dojo class are onLoadEditor and populateContent.

Listing 8. PreviewHtml Dojo class
onLoadEditor: function()
{
    this.id = this.getEditorId();
    this.htmlContentNode = dojo.byId( this.id + '_htmlContent' );

    this.populateContent();
},

populateContent: function( )
{
    console.log( "populateContent called" );

    var baseUrl = hub.urls.getAjaxUrl( this.plugin_id,this.entry_id, 'getHtmlContent');
    var htmlurl = baseUrl + "?htmlurl="   + escape( this.htmlUrl   );
    if ( this.htmlContentInternalNode )
        this.htmlContentNode.removeChild( this.htmlContentInternalNode );
    this.htmlContentInternalNode = document.createElement( 'iframe' );
    this.htmlContentInternalNode.setAttribute( "src", htmlurl ); 
    this.htmlContentInternalNode.setAttribute( "width", "100%" );
    this.htmlContentInternalNode.setAttribute( "height", "400px" );  
    this.htmlContentNode.appendChild( this.htmlContentInternalNode );
},

The function populateContent is called by onLoadEditor during page loading time. It dynamically creates an iframe to display the retrieved HTML localizing the effect of any included style sheets and scripts preventing them from affecting the appearance of other pages. The dynamically created iframe is appended to the static div created by the JSP. To retrieve the DOM node corresponding to the display area, we used the unique id of the div element generated by appending the unique feed instance id to a common suffix.

On the server side, we used a method on the PluginHelper class to get the unique feed instance id. On the browser side, we call the getEditorId function from the PreviewHtml Dojo class's parent i.e. hub.managers.InvokePlugin. To retrieve the HTML content, we will take advantage of the Iframe "src" attribute. The iframe will automatically retrieve and display the content pointed to by the src attribute during initialization. We will set the src attribute to invoke the editor plug-in getHtmlContent method. Note the way we create the URL by calling the getAjaxUrl function and appending the result to the string "getHtmlContent".


AJAX method to retrieve HTML

I mentioned in an earlier section that any public methods with RequestData and Entry as parameters may be invoked using an AJAX call. In particular, the method getHtmlContent can be called by the PreviewHtml Dojo class to return HTML from the user supplied URL. Because the actual HTML retrieval is common to feed generation and will be covered later, I will not provide any code snippets here. The only thing I want to point out is the return type of the method. In the earlier example, the AJAX method displayHtmlPage returns a ViewBean. AJAX methods in general can return any object and its toString value will be returned. See section 6.3.2 of Application Programming Interface Reference, Version 1.0.


Our last editor method: saveFeedEntry

saveFeedEntry is another public method of Html2XmlEditorPlugin invoked via AJAX to handle the final step in the editing process, saving what the user has entered. Is it similar to the save methods in other plug-ins. What's new is "resource" handling. Resources differs from attributes in size and type. Resources can be binary and can be up to one gigabyte in size. In contrast, attributes are limited to strings of size 10MB. The size limit for attribute should be sufficient for content, but for the illustrative purpose, we will save the HTML content as a resource. When the snapshot option is checked, the generator will only retrieve the HTML content from the specified url once. The HTML content is then converted to XML and saved. All subsequent feed generation requests will be satisfied from the saved XML. To handle the case where the user wants to make another snapshot because the site might have changed, we would like to delete the saved copy whenever the feed is edited. The code fragment shows how this is done in a two step process: retrieve the resource by name, followed by calling the deleteResource method on the returned object.

Listing 9. saveFeedEntry method
try {
    entry.generateURL(rdata.getBaseUrl(), this.getId() );
    entry.addAttribute(Html2XmlUrlViewBean.PARAM_HTMLURL, sHtmlUrl , this.getId() );
    entry.addAttribute(Html2XmlUrlViewBean.PARAM_SNAPSHOT , snapshot , this.getId() );
    // after every edit, cleanup any previously cached snapshot
    Resource  oldRes = entry.getResource( Html2XmlConstants.CACHED_XHTML );
    if ( oldRes != null )
        oldRes.deleteResource();
    } catch (HubException ex) {
        log.error("Error adding entry attribute.",ex);
}

We are finally done with the Editor and on to the Generator.


GeneratorPlugin

The Html2XmlGeneratorPlugin class extends the BaseGeneratorPlugin and must implement the abstract method generateFeed. It should be no surprise that the input parameters of type RequestData and Entry are identical to what are being passed in to EditorPlugin methods called by the feed generation framework. To generate the feed, one must first retrieve the attributes containing the configuration information saved during the editing process. This is done by calling the getAttribute method from Entry.

Listing 10. generateFeed method
public FeedContent generateFeed(RequestData rdata, Entry entry) {
    String  sHtmlUrl  = entry.getAttribute(Html2XmlUrlViewBean.PARAM_HTMLURL );
    String  snapshot  = entry.getAttribute(Html2XmlUrlViewBean.PARAM_SNAPSHOT );

Since this plug-in has no parameterization support, we do not need to retrieve the runtime supplied parameters. We will either return the saved XML content or retrieve the HTML content and convert to XML using JTidy. The logic illustrates how resources are created and is fairly straightforward.

Listing 11. generateFeed body
String result = "Html might have changed.  Table not found.";

Resource  oldRes = entry.getResource( Html2XmlConstants.CACHED_XHTML );
if ( "y".equals( snapshot ) &&  oldRes != null  ) {
   log.warn( "returning cached, snapshot=" + snapshot );
   returnnew FeedContent(oldRes.loadResource(), entry.getLifeTime());
}
            
String sHtml = getXhtml( sHtmlUrl );
if ( sHtml.length() > 0 ) {
    result = sHtml;
    if ( "y".equals( snapshot ) ) {
        try {
            Resource prepared = new Resource();
            prepared.setObjectid( entry.getObjectId() );
            prepared.setMimetype( "text/xml; charset=utf-8" );
            prepared.setFilename( Html2XmlConstants.CACHED_XHTML );
            prepared.uploadResource( sHtml.getBytes( "utf-8" ) );
       } catch (HubException e) {
            log.error(e);
       }
   }
}
returnnew FeedContent( result.getBytes( "utf-8" ), entry.getLifeTime());

Further details on how HTML input is converted to XML can be found in the java source files. I will just mention two key points. To make the output XML usable by the feed mashup editor, we stripped out any DOCTYPE declaration. In addition, the generation logic makes the simplifying assumption that the input HTML is in UTF-8 and need to be enhanced to support other languages.


Deployment

The complete Eclipse project with all the source files is available as a zip file in the download area. In addition, to make it easy to try out the plug-in, the plug-in zip (sample.mashupcenter.tidyhtml.zip) file is also provided. To install the plug-in, perform the following steps:

  1. Download Tidy.jar from the link in the resource section.
  2. After removing class files from the package org.w3c.dom, add Tidy.jar to the plug-in zip file under the directory WEB-INF/lib.
  3. Place the plug-in zip file in the <WebApplication>/WEB-INF/plugins directory.
  4. Stop and restart the server.

Conclusion

We have just walked through the construction of a more complicated plug-in involving multiple editor pages, custom JavaScript and saving of resources. You now have the basics to begin extending the feed generation capabilities of IBM Mashup Center. A subsequent article will discuss more advanced topics such as security and parameterization.


Download

DescriptionNameSize
Samples for this articleDownload.zip325KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Lotus
ArticleID=421935
ArticleTitle=An IBM Mashup Center plug-in to convert HTML to XML
publish-date=08182009