Make HTML5 microdata useful, Part 2: Next generation aggregation with microdata

Create a decentrally managed site with microdata and Drupal

Part 1 of this series showed how to use microdata with Schema.org terms so search engines can display your content better in search results. It also showed how to reuse that same microdata markup to improve the display on your own site. In this article, learn to use microdata to enable a collaborating group of site owners to easily hook up their sites and share content on a centralized site.

Share:

Lin Clark, Drupal Developer, Independent Drupal Consultant

Author photoLin Clark is a Drupal developer specializing in Linked Data. She is the maintainer of multiple Drupal modules, such as Microdata and SPARQL Views, and is an active participant in the W3C’s HTML Data Task Force and Drupal's HTML5 initiative. She attended Carnegie Mellon University and is finishing a research masters degree at the Digital Enterprise Research Institute at NUI Galway. More information is available at lin-clark.com.



06 March 2012

Also available in Chinese Russian Japanese Portuguese

Introduction

Large organizations often invest in large, centralized, standardized IT systems, such as a monolithic CMS, then try to get everyone to use that system. Unfortunately, getting everyone to use the system in the correct way is a challenge. Investing in a one-size-fits-all approach rarely delivers its promised productivity gains. It's especially hard to standardize or control loosely coupled organizations where teams interact rarely and make decisions independently. Examples of loosely coupled organizations include:

  • Departments in a university
  • Companies and individuals in an open source community
  • Teams in an amateur sports league

In Part 1 of this series, you learned about using generic scripts on top of microdata. You wrote a snippet of HTML to give you an interactive event map and to enable Google, Bing, and Yahoo to display your page better in search results with Rich Snippets.

Frequently used abbreviations

  • RDF: Resource Description Framework
  • RSS 2.0: Really Simple Syndication

In this article, learn how microdata can enable a collaborating group to easily hook up their sites and share content on a centralized group site. By agreeing on a small set of attributes to place in the HTML markup, loosely coupled organizations can maintain the independence of their information systems while still building a joint project.


Scenario: Creating a decentralized documentation system for Drupal

Many open source projects have difficulty maintaining robust and up-to-date documentation for their software. At the same time, contributors to the project share thorough technical explanations using blog posts, which are often aggregated on a Planet. A Planet is a blog aggregator that pulls in posts from selected authors (see Resources).

The Planet is an effective way to engage in current events and discussions in the community, but it doesn't fulfill its potential as a collaborative technology. It's difficult to filter through archives of Planet posts because they don't retain much of the original structured data. Even when the aggregated posts do contain helpful structured data, such as tags, posts from different sites often don't share terms or they use different spellings for terms. Thus, you can't effectively sort the posts.

You'll solve this problem by creating an aggregator that pulls in the blog posts and important extra information about the posts. You can use the aggregator to navigate the posts and put them in relevant places within the main site of the project.

The hypothetical system will document Drupal. Some widely used subsystems, called modules in Drupal, are well explained in blog posts but lack good documentation in the Drupal.org handbooks. The goal is to move that great documentation from the Planet into an easily searchable structure on Drupal.org.


Planning the system

The first task is to determine what information you want to pass from the blog posts to the central documentation system. For example, you'll want to indicate which modules this post talks about. There are often differences in the way modules work between major versions, so it's a good idea to indicate whether the tutorial is specific to a certain version of the module.

There are different roles in working with Drupal, from content editing to back-end development. It would be beneficial to indicate which roles will find the post helpful. The documentation record in Figure 1 shows the title, a description, audience, and related modules.

Figure 1. Example documentation record
Screen capture of an example documentation record

The data-sharing needs in the scenario are fairly simple. You need to pass only the following from the source blog post to the site:

  • Title
  • Teaser paragraph
  • URL
  • Audience
  • Modules

Title, teaser, and URL are already available in a structured format through RSS. You need to find a solution to pull in the structured data about the audience and modules. For this you'll use microdata. Before starting to work with microdata, though, you need to set up the source and target sites for testing.


Source sites

To parse and process the incoming posts, you will use a Microdata Import module. The module expects a feed URL, so the source must be able to output RSS or Atom.

You can use a CMS like Drupal, which has tools to automate the placement of microdata, or you could use another blogging system (as long as that system doesn't strip microdata attributes). For the Microdata Import module, each item that's imported must correlate to a single feed item, so post each tutorial on its own page.

The scenario uses the hosted blogging platforms Blogger and Drupal Gardens. You can set up your own sources, or use these:

All of the necessary information is directly in the HTML markup, so the tool you use for the source doesn't matter. The microdata in the HTML serves as a standard, read-only API without regard to the back-end code that produced it.


Setting up the target site

With sources to test, you can start setting up the aggregating site. First, the basic setup:

  1. Install Drupal 7 and download the following modules:
    • Microdata Import
    • Feeds
    • Ctools
    • Job Scheduler
    • Libraries
    • HTTP Client
  2. Enable Microdata Import and the Feeds Admin UI. You will be asked to enable four other dependencies.
  3. Download the MicrodataPHP library to sites/all/libraries/MicrodataPHP/MicrodataPhp.php.

    This library takes an HTML page and extracts the microdata.

Configure the import settings to use when you pull in the sources:

  1. Go to Structure -> Content types and create two content types: one to manage the feeds, and one to hold the tutorials themselves. You can call them Tutorial import and Tutorial. Leave all of the settings at their defaults.
  2. Go to Structure -> Feeds importers and add an importer.
  3. Click Settings in the left column in the Basic Settings section. In the Attach to content type dropdown, select the content type you just created and save it. In Figure 2, Tutorial import is selected.
    Figure 2. Configuring the basic settings of the Feeds Importer
    Screen capture of configuring the basic settings of the Feeds Importer
  4. Next to Parser, click Change. Switch to Microdata Import Parser (from RSS/Atom) and Save.

    You will see a confirmation at the top of the screen that says "Changed parser plugin."

  5. Under Processor, click Settings. Change the Update settings to Update existing nodes, as in Figure 3, and change the Content type by selecting Tutorial.

    Change the Text format to Filtered HTML. Because you're importing content from sites that you don't necessarily trust, you should not use Full HTML. It would make your site vulnerable to cross-site scripting.

    Figure 3. Configuring settings for processing the feed item into a node
    Screen capture of configuring the settings for processing the feed item into a node
  6. Under Processor, click Mapping. This is where you define what bits of the source post to add to the target node, and where they are added. Because you haven't yet added the information about the available microdata content, the only elements listed are those that are exposed in RSS/Atom.
  7. Map the URL to URL and click Add. Check the Unique Target check box and click Save. This ensures that on subsequent runs you can match the items and replicate any changes from the source to the target.
  8. Map the Title to Title and click Add.
  9. Map the Description to Body and click Add, as in Figure 4.
Figure 4. Mappings from source to target
Screen capture of the mappings from source to target

To test that you can import content:

  1. Click Add content and add a new Tutorial import.
  2. Give it a title of Source 1.
  3. In the Feed URL field, add the source feed and Save.
  4. Select the Import tab and click Import. You should get a message that one or more nodes was created, as in Figure 5.
Figure 5. Nodes imported from the source
Screen capture of the nodes imported from the source

Go to the home page to see the posts that were imported. It might not be clear why you need to do anything more, since you already imported the whole post. However, because you didn't keep the structure from the original post intact, you cannot filter the posts based on audience or module yet. This is where the microdata comes in.

Figure 6. An imported node
Screen capture of an imported node with title, description, audience, and modules information

Marking up source content for consumption

Now that you can pull in the feeds properly, start adding the microdata to the markup and bringing it in with the posts. Listing 1 shows the basic markup for a blog post.

Listing 1. Basic HTML markup for a post
<h2>Building modules on top of SPARQL Views</h2>
<div>

 <p>This video demonstrates how you can build a module that installs a 
 View powered by a SPARQL query whenever it is enabled.</p>
  <b>Audience:</b> Developer <br />
  <b>Modules:</b>
  <ul>
    <li>Views</li>
    <li>SPARQL Views</li>
  </ul>
</div>

You'll want to indicate that the content is an article. This scenario uses the Schema.org vocabulary to mark up the articles because Schema.org has terms for most of the things that need to be annotated (see Resources). You could use a different vocabulary if all of the collaborating authors agree to it. [In "Combine Drupal, HTML5, and microdata" (see Resources), I go into more detail about how to place microdata. It shows you how to add microdata by hand or how to automate the process with the Microdata module.]

You're pulling the title from the RSS feed so you don't have to mark up the title. However, marking it up makes it easier for other consumers to reuse the data. Use the name property, as in Listing 2. Since the title is outside of the article div, you have to add a meta element, which gives the title, inside the div. Use the description property for the teaser paragraph, which will give more fine-grained access than the RSS description does.

Listing 2. Adding basic microdata to the post
<h2>Building modules on top of SPARQL Views</h2>
<div itemscope="" itemtype="http://schema.org/Article">

  <meta itemprop="name" content="Building modules on top of SPARQL Views" />

    <p itemprop="description">This video demonstrates how you can build a module 
       that installs a View powered by a SPARQL query whenever it is enabled.</p>
    ...
</div>

Updating the description

Now that it is marked up with microdata, you can pull just the description out of the text. This will exclude the audience and related modules from the description, which is good since you will later pull them into their own fields. Change the mapping to use the microdata description instead of the RSS description.

  1. Go to Structure -> Feeds importers and edit your importer.
  2. Under Parser, click Settings. Enter an example source page in the field, as in Figure 7.

    The example page will be parsed to see which properties are available, so the example should be as complete as possible. Save the settings.

    Figure 7. Providing property paths using an example page
    Screen capture of providing property paths using an example page
  3. Under Processor, click Mapping. In the Description row, check Remove and then Save. This will remove the mapping between the RSS description and the body field.
  4. Click the Select a source drop-down item. The list now includes new source elements, which were determined from the example source.
  5. Select the new description element (the second description element in the list), map it to the body, and click Add, as in Figure 8.
    Figure 8. Updating the description mapping
    Screen capture of updating the description mapping
  6. Find your Source 1 Tutorial import, and click the Import tab. Click the Import button and the nodes will be updated. The Audience and Module text are no longer combined with the description, as in Figure 9.
    Figure 9. Updated node using the microdata description instead of the RSS description
    Screen capture of updated node using the microdata description instead of the RSS description

The update did more than just remove the parts of the description you didn't want. It performed a full update of all of the imported information. If you changed the wording of the source post, the new wording would show up here as well. The update function is configured to run on cron, so you don't need to trigger it manually to get regular updates. You can see the power of the Feeds system. It enables the easy, automated synchronizing that's necessary to create an effective network of collaborating sites.


Adding audience roles

Thus far, you've marked up bits of information that are common to all articles: title and description. In this section, you will move on to information that is specific to the scenario. It's important to work with the collaborating group to learn how they naturally think of the content they're providing. The information should reflect the mental model of the collaborators (rather than conforming to an external ideal).

The Drupal community already has a well defined set of roles, expressed in the division of tracks at conferences. The Drupal Skill Map project defines the roles as:

  • System Architect
  • Developer
  • Themer
  • Site Builder
  • Content Editor
  • Design/UX
  • Project Manager
  • Drupal Marketer

You need to indicate that the audience for the article is one or more of the eight groups. Unfortunately, the Schema.org vocabulary doesn't have a concept of audience, which leaves you with two options for handling the itemprop:

  • Extend Schema.org using their documented extension mechanism by taking an existing property and adding /audience to the end.

    For example, you could extend the keywords property to be keywords/audience, as follows.

    Audience: <span itemprop="keywords/audience">Developer</span>
  • Use a term from another vocabulary or create your own vocabulary.

    For example, if a Tutorial vocabulary had an audience property (and if you could use strings as values of that property), you could use that alternate property. Because you use the http://schema.org/Article itemtype, you have to reference the Tutorial vocabulary property by its full URL instead of by the short property names you've been using. The full URL would be something like http://tutorial-vocabulary.org/audience. The exact URL would be specified in the vocabulary documentation. The URL would be placed in the itemprop attribute, as follows.

    Audience: <span
    itemprop="http://tutorial-vocabulary.org/audience">Developer</span>

For this scenario, go with the first option and extend Schema.org. If you are placing the microdata by hand, copy and paste Listing 3 into your body. If you are using Drupal to automatically place microdata on the source site, you can create a List (text) field on the post, which gives you check boxes on the post for selecting the audiences. By adding the keywords/audience property in the field settings, the proper microdata will be automatically output.


Consuming audience roles

To bring the audience element into the content on the consuming site, you need to create a field for it on the content type and then create the mapping for that field.

  1. Go to Structure -> Content types and click Manage fields for the Tutorial content type.
  2. Add an audience textfield.
  3. Go to Structure -> Feeds importers and edit your importer.
  4. Under Processor, click Mapping.
  5. Map the keywords/audience element to the new Audience field and click Add, as in Figure 10.
    Figure 10. Adding keywords/audience property to the mapping
    Screen capture of adding the keywords/audience property to the mapping
  6. Find your Source 1 Tutorial import again and click Import. Go to an updated Tutorial's page, and you'll see the Audience field is populated, as in Figure 11.
Figure 11. Updated node with audience (in full node view)
Screen capture of updated node with audience (in full node view)

Adding related modules

Arguably, using strings is adequate for categorizing posts by their audience. There are only a handful of audience roles, and the roles don't change much over time. It's not an extraordinary coordination challenge to get people to update their audience field settings if there is a change, such as adding a new role.

Categorizing by related modules is a different case, however. Drupal has 14,000 modules, most of which have multiple versions. At the very least, this means 14,000 different tags. The format of the tags can vary a lot. In addition, the thousands of module maintainers can change the name of their modules at any time.

Something more stable than strings is needed to refer to the module. One identifier that can't change arbitrarily is the module URL on Drupal.org (for example, http://drupal.org/project/views for the Views module). You can use this as a consistent identifier for modules.

To add the version of the module, you could add a property of the module item. However, for this scenario, it's easier to have a different ID for each version. For example, to identify Views 7.x-3.x, you would use the URL http://drupal.org/project/views/7/3. While currently that URL does not display a page, it's easy to imagine a page at that location that displays all of the tutorials for Views 7.x-3.x and provides a download of the release.

To use a string instead of an ID for the value, use microdata's itemid attribute. The itemid gets placed in the same tag as the itemscope and itemtype attributes. Use a Google-specific Schema.org term, http://schema.org/SoftwareApplication, for the itemtype. Use the about property to say that the Article is about the module.

The visible content will still be the name string. You won't use it for consumption, but it might make it easier for other consumers to work with your data. Expose it as the name property of the module, as in Listing 3.

Listing 3. Adding microdata for related modules
<p>Modules:
  <ul>
    <li itemprop="about" itemscope="" 
        itemtype="http://schema.org/SoftwareApplication" 
        itemid="http://drupal.org/project/views/7/3">
      <span itemprop="name">Views</span>
    </li>
    <li itemprop="about" itemscope="" 
        itemtype="http://schema.org/SoftwareApplication" 
        itemid="http://drupal.org/project/sparql_views/7/2">
      <span itemprop="name">SPARQL Views</span>
    </li>
  </ul>
</p>

Adding all of this by hand isn't trivial. If possible, you'll want tools that can help. If you use Drupal for the source, you can use the Web Taxonomy module to help content authors tag their posts. With Web Taxonomy, the autocomplete results come from a taxonomy defined on the web. When you choose a term, it will be imported to your site. This means you have access to the tens of thousands of terms in the Drupal Projects vocabulary without having them stored in your database. Whenever a new tag is added or a tag is changed, your autocomplete field will have access to it—you don't even have to think about it.

You'll also use Web Taxonomy to consume the related modules, so you need to configure it on the target site as well.


Configuring Web Taxonomy for related modules

Download and enable Web Taxonomy. To configure Web Taxonomy, you also need a module that defines which external taxonomy to use and how it can be accessed. The module for Drupal Full Projects is available at http://drupal.org/sandbox/linclark/1363774. When you enable that module, a new Drupal Full Projects vocabulary is added to your site.

Configure Web Taxonomy in the same way on both the target and the source site:

  1. Go to Structure -> Content types and manage fields on the Tutorial content type.
  2. Add the Related Modules field, as in Figure 12. Select the Taxonomy Term Reference field type and the Web Taxonomy autocomplete widget.
    Figure 12. Adding the Web Taxonomy field
    Screen capture of adding the Web Taxonomy field
  3. Choose the Drupal Full Projects vocabulary and Save the field settings.
  4. Change the number of values to Unlimited and Save the settings.

Now you can test the field by editing a tutorial and typing in a module name. The autocomplete field will provide you with suggestions, as in Figure 13. If you select one and save the tutorial, you'll see that the tag shows up when you view it. If you click through to the term page, the URL that you're using for an ID shows up on the term.

Figure 13. Web Taxonomy autocomplete for Drupal Full Projects
Screen capture of the Web Taxonomy autocomplete for Drupal Full Projects

If you are configuring the field on the source, change the itemid that's assigned to the term by downloading and enabling the contributed Token module. Go to Structure -> Taxonomy and edit the Drupal Full Projects vocabulary. In the field with the token to use for the itemid, change the value to [term:web_tid]. The global Web Term ID for the term will be used instead of the local path.


Consuming the related modules

At this point, you have the Web Taxonomy field available and have tested it. You can start importing to it now.

  1. Go to Structure -> Feeds importers and edit your importer.
  2. Map about:itemid to Related Modules: Web Term ID and click Add.
  3. Find your Source 1 Tutorial import again and Import. You'll see that the related modules have been added, as in Figure 14.
Figure 14. Updated node with related modules
Screen capture of the updated node with two related modules

Now that you have fully configured the import settings, you can create multiple Tutorial Import nodes and import data from multiple sites. (Create a Tutorial import node and add the Source 2 feed). All of the imported data is now structured in a way that Drupal understands. You can easily set up a user interface that lets you browse the whole collection of tutorials by facets.


Going beyond Microdata Import

Microdata Import can help share knowledge across organizational boundaries. However, sometimes it isn't quite enough. Microdata Import assumes there is a one-to-one correlation between the page that you're importing and the page on your site. This works for the scenario above because you only want information about the tutorial to be provided from the tutorial page itself.

Sometimes, though, you might want different people who are publishing on different sites to be able to add information about the same item. For example, if a professor at a university has a joint appointment in two departments, each department should be able to add information about that professor to its own site without coordinating with the other department. Though you might possibly configure your sources and feed importers to enable this with Microdata Import, there are easier ways to enable this.

One approach is to convert the microdata to RDF. The HTML Data Task Force is currently finalizing a draft specification of a mapping from microdata to RDF. The mapping will enable parsers, such as the MicrodataPHP library, to reliably generate RDF from pages that are marked up with microdata. RDF distiller, a tool developed by Gregg Kellogg (who is taking the lead in specifying the mapping), already implements this mapping. The RDF distiller is available as a Ruby gem. Kellogg also has an API available on his site.


Conclusion

Figuring out how to enable collaborative web content authoring across organizational and technical boundaries is a major challenge in IT. By embedding the structure of the content into the HTML itself, microdata helps groups of loosely coupled people and organizations coordinate on joint projects. Most importantly, the collaborators can contribute to the common product without losing the freedom to choose their own framework, even if that framework is just hand-coded HTML.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source, Web development
ArticleID=799402
ArticleTitle=Make HTML5 microdata useful, Part 2: Next generation aggregation with microdata
publish-date=03062012