Large organizations often invest in large, centralized, standardized IT systems, such as a monolithic CMS, then try to get everyone to use that system. Unfortunately, getting everyone to use the system in the correct way is a challenge. Investing in a one-size-fits-all approach rarely delivers its promised productivity gains. It's especially hard to standardize or control loosely coupled organizations where teams interact rarely and make decisions independently. Examples of loosely coupled organizations include:
- Departments in a university
- Companies and individuals in an open source community
- Teams in an amateur sports league
In Part 1 of this series, you learned about using generic scripts on top of microdata. You wrote a snippet of HTML to give you an interactive event map and to enable Google, Bing, and Yahoo to display your page better in search results with Rich Snippets.
In this article, learn how microdata can enable a collaborating group to easily hook up their sites and share content on a centralized group site. By agreeing on a small set of attributes to place in the HTML markup, loosely coupled organizations can maintain the independence of their information systems while still building a joint project.
Scenario: Creating a decentralized documentation system for Drupal
Many open source projects have difficulty maintaining robust and up-to-date documentation for their software. At the same time, contributors to the project share thorough technical explanations using blog posts, which are often aggregated on a Planet. A Planet is a blog aggregator that pulls in posts from selected authors (see Resources).
The Planet is an effective way to engage in current events and discussions in the community, but it doesn't fulfill its potential as a collaborative technology. It's difficult to filter through archives of Planet posts because they don't retain much of the original structured data. Even when the aggregated posts do contain helpful structured data, such as tags, posts from different sites often don't share terms or they use different spellings for terms. Thus, you can't effectively sort the posts.
You'll solve this problem by creating an aggregator that pulls in the blog posts and important extra information about the posts. You can use the aggregator to navigate the posts and put them in relevant places within the main site of the project.
The hypothetical system will document Drupal. Some widely used subsystems, called modules in Drupal, are well explained in blog posts but lack good documentation in the Drupal.org handbooks. The goal is to move that great documentation from the Planet into an easily searchable structure on Drupal.org.
The first task is to determine what information you want to pass from the blog posts to the central documentation system. For example, you'll want to indicate which modules this post talks about. There are often differences in the way modules work between major versions, so it's a good idea to indicate whether the tutorial is specific to a certain version of the module.
There are different roles in working with Drupal, from content editing to back-end development. It would be beneficial to indicate which roles will find the post helpful. The documentation record in Figure 1 shows the title, a description, audience, and related modules.
Figure 1. Example documentation record
The data-sharing needs in the scenario are fairly simple. You need to pass only the following from the source blog post to the site:
- Title
- Teaser paragraph
- URL
- Audience
- Modules
Title, teaser, and URL are already available in a structured format through RSS. You need to find a solution to pull in the structured data about the audience and modules. For this you'll use microdata. Before starting to work with microdata, though, you need to set up the source and target sites for testing.
To parse and process the incoming posts, you will use a Microdata Import module. The module expects a feed URL, so the source must be able to output RSS or Atom.
You can use a CMS like Drupal, which has tools to automate the placement of microdata, or you could use another blogging system (as long as that system doesn't strip microdata attributes). For the Microdata Import module, each item that's imported must correlate to a single feed item, so post each tutorial on its own page.
The scenario uses the hosted blogging platforms Blogger and Drupal Gardens. You can set up your own sources, or use these:
All of the necessary information is directly in the HTML markup, so the tool you use for the source doesn't matter. The microdata in the HTML serves as a standard, read-only API without regard to the back-end code that produced it.
With sources to test, you can start setting up the aggregating site. First, the basic setup:
- Install Drupal 7 and download the following modules:
- Microdata Import
- Feeds
- Ctools
- Job Scheduler
- Libraries
- HTTP Client
- Enable Microdata Import and the Feeds Admin UI. You will be asked to enable four other dependencies.
- Download the MicrodataPHP library to sites/all/libraries/MicrodataPHP/MicrodataPhp.php.
This library takes an HTML page and extracts the microdata.
Configure the import settings to use when you pull in the sources:
- Go to Structure -> Content types and create two content types: one to manage
the feeds, and one to hold the tutorials themselves. You can call them
Tutorial importandTutorial. Leave all of the settings at their defaults. - Go to Structure -> Feeds importers and add an importer.
- Click Settings in the left column in the Basic Settings section. In the Attach
to content type dropdown, select the content type you just created and save it.
In Figure 2, Tutorial import is selected.
Figure 2. Configuring the basic settings of the Feeds Importer
- Next to Parser, click Change. Switch to Microdata Import Parser (from RSS/Atom) and Save.
You will see a confirmation at the top of the screen that says "Changed parser plugin."
- Under Processor, click Settings. Change the Update settings to Update
existing nodes, as in Figure 3, and change the Content type by selecting Tutorial.
Change the Text format to Filtered HTML. Because you're importing content from sites that you don't necessarily trust, you should not use Full HTML. It would make your site vulnerable to cross-site scripting.
Figure 3. Configuring settings for processing the feed item into a node
- Under Processor, click Mapping. This is where you define what bits of the source post to add to the target node, and where they are added. Because you haven't yet added the information about the available microdata content, the only elements listed are those that are exposed in RSS/Atom.
- Map the URL to URL and click Add. Check the Unique Target check box and click Save. This ensures that on subsequent runs you can match the items and replicate any changes from the source to the target.
- Map the Title to Title and click Add.
- Map the Description to Body and click Add, as in Figure 4.
Figure 4. Mappings from source to target
To test that you can import content:
- Click Add content and add a new Tutorial import.
- Give it a title of
Source 1. - In the Feed URL field, add the source feed and Save.
- Select the Import tab and click Import. You should get a message that one or more nodes was created, as in Figure 5.
Figure 5. Nodes imported from the source
Go to the home page to see the posts that were imported. It might not be clear why you need to do anything more, since you already imported the whole post. However, because you didn't keep the structure from the original post intact, you cannot filter the posts based on audience or module yet. This is where the microdata comes in.
Figure 6. An imported node
Marking up source content for consumption
Now that you can pull in the feeds properly, start adding the microdata to the markup and bringing it in with the posts. Listing 1 shows the basic markup for a blog post.
Listing 1. Basic HTML markup for a post
<h2>Building modules on top of SPARQL Views</h2>
<div>
<p>This video demonstrates how you can build a module that installs a
View powered by a SPARQL query whenever it is enabled.</p>
<b>Audience:</b> Developer <br />
<b>Modules:</b>
<ul>
<li>Views</li>
<li>SPARQL Views</li>
</ul>
</div>
|
You'll want to indicate that the content is an article. This scenario uses the Schema.org vocabulary to mark up the articles because Schema.org has terms for most of the things that need to be annotated (see Resources). You could use a different vocabulary if all of the collaborating authors agree to it. [In "Combine Drupal, HTML5, and microdata" (see Resources), I go into more detail about how to place microdata. It shows you how to add microdata by hand or how to automate the process with the Microdata module.]
You're pulling the title from the RSS feed so you don't have to mark up the title. However,
marking it up makes it easier for other consumers to reuse the data. Use the
name property, as in Listing 2. Since the title is outside of the article
div, you have to add a meta
element, which gives the title, inside the div. Use the
description property for the teaser paragraph, which will give more fine-grained access than the RSS description does.
Listing 2. Adding basic microdata to the post
<h2>Building modules on top of SPARQL Views</h2>
<div itemscope="" itemtype="http://schema.org/Article">
<meta itemprop="name" content="Building modules on top of SPARQL Views" />
<p itemprop="description">This video demonstrates how you can build a module
that installs a View powered by a SPARQL query whenever it is enabled.</p>
...
</div>
|
Now that it is marked up with microdata, you can pull just the description out of the text. This will exclude the audience and related modules from the description, which is good since you will later pull them into their own fields. Change the mapping to use the microdata description instead of the RSS description.
- Go to Structure -> Feeds importers and edit your importer.
- Under Parser, click Settings. Enter an example source page in the field, as in Figure 7.
The example page will be parsed to see which properties are available, so the example should be as complete as possible. Save the settings.
Figure 7. Providing property paths using an example page
- Under Processor, click Mapping. In the Description row, check Remove and then Save. This will remove the mapping between the RSS description and the body field.
- Click the Select a source drop-down item. The list now includes new source elements, which were determined from the example source.
- Select the new description element (the second description element in the list), map
it to the body, and click Add, as in Figure 8.
Figure 8. Updating the description mapping
- Find your Source 1 Tutorial import, and click the Import tab. Click the
Import button and the nodes will be updated. The Audience and Module text are
no longer combined with the description, as in Figure 9.
Figure 9. Updated node using the microdata description instead of the RSS description
The update did more than just remove the parts of the description you didn't want. It performed a full update of all of the imported information. If you changed the wording of the source post, the new wording would show up here as well. The update function is configured to run on cron, so you don't need to trigger it manually to get regular updates. You can see the power of the Feeds system. It enables the easy, automated synchronizing that's necessary to create an effective network of collaborating sites.
Thus far, you've marked up bits of information that are common to all articles: title and description. In this section, you will move on to information that is specific to the scenario. It's important to work with the collaborating group to learn how they naturally think of the content they're providing. The information should reflect the mental model of the collaborators (rather than conforming to an external ideal).
The Drupal community already has a well defined set of roles, expressed in the division of tracks at conferences. The Drupal Skill Map project defines the roles as:
- System Architect
- Developer
- Themer
- Site Builder
- Content Editor
- Design/UX
- Project Manager
- Drupal Marketer
You need to indicate that the audience for the article is one or more of the eight groups. Unfortunately, the Schema.org
vocabulary doesn't have a concept of audience, which leaves you with two options for handling
the itemprop:
- Extend Schema.org using their documented extension mechanism by
taking an existing property and adding
/audienceto the end.For example, you could extend the
keywordsproperty to bekeywords/audience, as follows.Audience: <span itemprop="keywords/audience">Developer</span>
- Use a term from another vocabulary or create your own vocabulary.
For example, if a Tutorial vocabulary had an audience property (and if you could use strings as values of that property), you could use that alternate property. Because you use the http://schema.org/Article
itemtype, you have to reference the Tutorial vocabulary property by its full URL instead of by the short property names you've been using. The full URL would be something likehttp://tutorial-vocabulary.org/audience. The exact URL would be specified in the vocabulary documentation. The URL would be placed in theitempropattribute, as follows.Audience: <span itemprop="http://tutorial-vocabulary.org/audience">Developer</span>
For this scenario, go with the first option and extend Schema.org. If you are
placing the microdata by hand, copy and paste Listing 3 into your body. If you are using Drupal to automatically place microdata on the source site,
you can create a List (text) field
on the post, which gives you check boxes on the post for selecting the audiences.
By adding the keywords/audience property in the field settings, the proper
microdata will be automatically output.
To bring the audience element into the content on the consuming site, you need to create a field for it on the content type and then create the mapping for that field.
- Go to Structure -> Content types and click Manage fields for the Tutorial content type.
- Add an audience textfield.
- Go to Structure -> Feeds importers and edit your importer.
- Under Processor, click Mapping.
- Map the
keywords/audienceelement to the new Audience field and click Add, as in Figure 10.
Figure 10. Addingkeywords/audienceproperty to the mapping
- Find your Source 1 Tutorial import again and click Import. Go to an updated Tutorial's page, and you'll see the Audience field is populated, as in Figure 11.
Figure 11. Updated node with audience (in full node view)
Arguably, using strings is adequate for categorizing posts by their audience. There are only a handful of audience roles, and the roles don't change much over time. It's not an extraordinary coordination challenge to get people to update their audience field settings if there is a change, such as adding a new role.
Categorizing by related modules is a different case, however. Drupal has 14,000 modules, most of which have multiple versions. At the very least, this means 14,000 different tags. The format of the tags can vary a lot. In addition, the thousands of module maintainers can change the name of their modules at any time.
Something more stable than strings is needed to refer to the module. One identifier that can't change arbitrarily is the module URL on Drupal.org (for example, http://drupal.org/project/views for the Views module). You can use this as a consistent identifier for modules.
To add the version of the module, you could add a property of the module item.
However, for this scenario, it's easier to have a different ID for each version. For
example, to identify Views 7.x-3.x, you would use the URL http://drupal.org/project/views/7/3. While currently that URL does
not display a page, it's easy to imagine a page at that location that displays all of the tutorials for Views 7.x-3.x and provides a download of the release.
To use a string instead of an ID for the value, use microdata's itemid attribute. The itemid gets placed
in the same tag as the itemscope and itemtype attributes. Use a Google-specific Schema.org term, http://schema.org/SoftwareApplication, for the itemtype. Use the about property to say that the Article is about the module.
The visible content will still be the name string. You won't use it for
consumption, but it might make it easier for other consumers to work with your data.
Expose it as the name property of the module, as in Listing 3.
Listing 3. Adding microdata for related modules
<p>Modules:
<ul>
<li itemprop="about" itemscope=""
itemtype="http://schema.org/SoftwareApplication"
itemid="http://drupal.org/project/views/7/3">
<span itemprop="name">Views</span>
</li>
<li itemprop="about" itemscope=""
itemtype="http://schema.org/SoftwareApplication"
itemid="http://drupal.org/project/sparql_views/7/2">
<span itemprop="name">SPARQL Views</span>
</li>
</ul>
</p>
|
Adding all of this by hand isn't trivial. If possible, you'll want tools that can help. If you use Drupal for the source, you can use the Web Taxonomy module to help content authors tag their posts. With Web Taxonomy, the autocomplete results come from a taxonomy defined on the web. When you choose a term, it will be imported to your site. This means you have access to the tens of thousands of terms in the Drupal Projects vocabulary without having them stored in your database. Whenever a new tag is added or a tag is changed, your autocomplete field will have access to it—you don't even have to think about it.
You'll also use Web Taxonomy to consume the related modules, so you need to configure it on the target site as well.
Configuring Web Taxonomy for related modules
Download and enable Web Taxonomy. To configure Web Taxonomy, you also need a module that defines which external taxonomy to use and how it can be accessed. The module for Drupal Full Projects is available at http://drupal.org/sandbox/linclark/1363774. When you enable that module, a new Drupal Full Projects vocabulary is added to your site.
Configure Web Taxonomy in the same way on both the target and the source site:
- Go to Structure -> Content types and manage fields on the Tutorial content type.
- Add the Related Modules field, as in Figure
12. Select the Taxonomy Term Reference field type and the Web Taxonomy autocomplete widget.
Figure 12. Adding the Web Taxonomy field
- Choose the Drupal Full Projects vocabulary and Save the field settings.
- Change the number of values to Unlimited and Save the settings.
Now you can test the field by editing a tutorial and typing in a module name. The autocomplete field will provide you with suggestions, as in Figure 13. If you select one and save the tutorial, you'll see that the tag shows up when you view it. If you click through to the term page, the URL that you're using for an ID shows up on the term.
Figure 13. Web Taxonomy autocomplete for Drupal Full Projects
If you are configuring the field on the source, change the itemid that's assigned to the term by downloading and enabling the
contributed Token module. Go to Structure -> Taxonomy and edit the
Drupal Full Projects vocabulary. In the field with the token to use for the itemid, change the value to [term:web_tid]. The global Web Term ID for the term will be used instead of the local path.
At this point, you have the Web Taxonomy field available and have tested it. You can start importing to it now.
- Go to Structure -> Feeds importers and edit your importer.
- Map
about:itemidto Related Modules: Web Term ID and click Add. - Find your Source 1 Tutorial import again and Import. You'll see that the related modules have been added, as in Figure 14.
Figure 14. Updated node with related modules
Now that you have fully configured the import settings, you can create multiple Tutorial Import nodes and import data from multiple sites. (Create a Tutorial import node and add the Source 2 feed). All of the imported data is now structured in a way that Drupal understands. You can easily set up a user interface that lets you browse the whole collection of tutorials by facets.
Microdata Import can help share knowledge across organizational boundaries. However, sometimes it isn't quite enough. Microdata Import assumes there is a one-to-one correlation between the page that you're importing and the page on your site. This works for the scenario above because you only want information about the tutorial to be provided from the tutorial page itself.
Sometimes, though, you might want different people who are publishing on different sites to be able to add information about the same item. For example, if a professor at a university has a joint appointment in two departments, each department should be able to add information about that professor to its own site without coordinating with the other department. Though you might possibly configure your sources and feed importers to enable this with Microdata Import, there are easier ways to enable this.
One approach is to convert the microdata to RDF. The HTML Data Task Force is currently finalizing a draft specification of a mapping from microdata to RDF. The mapping will enable parsers, such as the MicrodataPHP library, to reliably generate RDF from pages that are marked up with microdata. RDF distiller, a tool developed by Gregg Kellogg (who is taking the lead in specifying the mapping), already implements this mapping. The RDF distiller is available as a Ruby gem. Kellogg also has an API available on his site.
Figuring out how to enable collaborative web content authoring across organizational and technical boundaries is a major challenge in IT. By embedding the structure of the content into the HTML itself, microdata helps groups of loosely coupled people and organizations coordinate on joint projects. Most importantly, the collaborators can contribute to the common product without losing the freedom to choose their own framework, even if that framework is just hand-coded HTML.
Learn
- Make HTML5 microdata useful, Part 1: Using jQuery on top of microdata (Lin Clark, developerWorks, November 2011): Read the first part of this series to learn to write a snippet of HTML both to give you an interactive event map and to enable Google, Bing, and Yahoo to display your page better in search results with Rich Snippets.
- Combine Drupal, HTML5, and microdata (Lin Clark, developerWorks, November 2011): Read more about adding microdata to pages in Drupal.
- Schema.org: Learn more about this collection of schemas, which are HTML tags that webmasters can use to mark up their pages in ways recognized by major search providers.
- Getting started with schema.org: In these tutorials, learn how to place schema.org terms on the Schema.org site.
- The Semantic Web, Linked Data and Drupal, Part 1: Expose your data using RDF (Lin Clark, developerWorks, April 2011): Make your web data more interoperable and your data sharing more efficient. An example shows how to use Drupal 7 to publish Linked Data by exposing content with RDF.
- The Semantic Web, Linked Data and Drupal, Part 2: Combine linked datasets with Drupal 7 and SPARQL Views (Stéphane Corlosquet and Lin Clark, developerWorks, May 2011): Learn to use the existing Linked Data available today on the web of data, and how to enrich a Drupal 7 site with data coming from different endpoints.
- Scientific American article on the Semantic Web: Read this seminal article by Tim Berners-Lee, James Hendler, and Ora Lassila.
- Linked Data: Read the ReadWriteWeb interview about linked data with Tim Berners-Lee.
- Linked Data Design Issues: Learn more about linked data from Tim Berners-Lee.
- Rich snippets (microdata, microformats, and RDFa): Read more about rich snippets and structured data on Google.
- Implement Semantic Web standards in your Web site (Rob Crowther, developerWorks, May 2008): Create a simple social networking site using PHP and MySQL, which implements Semantic web standards such as hCard and Friend of a Friend (FOAF) as part of a semantic Uniform Resource Identifier (URI) scheme.
- FOAF Vocabulary Specification 0.98: Explore the FOAF language, defined as a dictionary of named properties and classes using W3C's RDF technology.
- Dublin Core Metadata Initiative (DCMI): Learn about this open organization engaged in the development of interoperable metadata standards that support a broad range of purposes and business models.
- SIOC (Semantically-Interlinked Online Communities) Core Ontology Specification: Learn the main concepts and properties required to describe information from online communities (such as message boards, wikis, or weblogs) on the Semantic web.
- SPARQL Explorer for http://dbpedia.org/sparql: Try a demonstration query interface available on the web.
- New to XML? Get the resources you need to learn XML.
- developerWorks Web development zone: Find articles covering various web-based solutions.
- XML area on developerWorks: Find the resources you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- RDF Distiller: Test the microdata to RDF mapping.
- Google's Rich Snippets Testing Tool: Test for Rich Snippets.
- Live Microdata testing tool: Get another tool, created by Opera developer Philip Jägenstedt, for testing microdata.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- developerWorks profile: Create your profile today and set up a watchlist.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Lin Clark is a Drupal developer specializing in Linked Data. She is the maintainer of multiple Drupal modules, such as Microdata and SPARQL Views, and is an active participant in the W3C’s HTML Data Task Force and Drupal's HTML5 initiative. She attended Carnegie Mellon University and is finishing a research masters degree at the Digital Enterprise Research Institute at NUI Galway. More information is available at lin-clark.com.




