The World Wide Web empowers human beings like never before. The sheer amount and diversity of information you encounter on the Web is staggering. You can find recipes and sports scores; you share calendars and contact information; you read news stories and restaurant reviews. You can constantly consume data on the Web that's presented in a variety of appealing ways: charts and tables, diagrams and figures, paragraphs and pictures.
Yet this content-rich, human-friendly world has a shadowy underworld. It's a world in which machines attempt to benefit from this wealth of data that's so easily accessible to humans. It's the world of aggregators and agents, reasoners and visualizations, all striving to improve the productivity of their human masters. But the machines often struggle to interpret the mounds of information intended for human consumption.
The story is not all bleak, however. Even if you were unaware of this human-computer conflict, there is no need to worry. By the end of this series, you'll have enough knowledge to choose intelligently among a myriad of possible paths to bridge between the data-presentation needs of machine and human data consumers.
In the early 1990s, Tim Berners-Lee invented HTML, HTTP, and the World Wide Web. He initially designed the Web to be an information space for more than just human-to-human interactions. He intended the Web to be a semantically rich network of data that could be browsed by humans and acted upon by computer programs. This vision of the Web is still referred to as the Semantic Web.
However, by the very nature of its inhabitants, the Web grew exponentially and gave priority to content consumable mostly by humans and not machines. Steadily, users' lives became more reliant on the Web, and we transitioned from a Web of personal and academic homepages to a Web of e-commerce and business-to-business transactions. Even as more and more of the world's most vital information flowed through its links, most of the Web-enabled interactions still required human interpretation. But as expected, the rise of Internet-connected devices in people's lives has driven dependence on the devices' software understanding data on the Web.
Clearly, machines have interacted with each other long before the Web existed. And if the Web has come so far so quickly while primarily targeted at human consumption, it's natural to wonder what's to be gained in developing techniques for machines to share the Web with humans as an information channel. To explore this question, imagine what the current Web would look like if machines did not understand it at all.
The top three Web sites (as of this article's writing) according to Alexa traffic rankings are Yahoo!, MSN, and Google -- all search engines. Each of these sites is powered by an army of software-driven Web crawlers that apply various techniques to index human-generated Web content and make it amenable to text searches. Without these companies' vast arrays of algorithmic techniques for consuming the Web, your Web-navigation experiences would be limited to following explicitly declared hypertext links.
Next, consider the 5th most trafficked site on Alexa's list: eBay. People commonly think of eBay as one of the best examples of humans interacting on the Web. However, machines play a significant role in eBay's popularity. Approximately 47% of eBay's listings are created using software agents rather than with the human-driven forms. During the last quarter of 2005, the machine-oriented eBay platform handled eight billion service requests. Also in 2005, the number of eBay transactions through Web Services APIs increased by 84% annually. It's clear that without the services that eBay provides software agents to participate equally with humans, the online auction business would not be nearly as manageable for humans dealing with significant numbers of sales or purchases.
For a third example, we turn to Web feeds. Content-syndication formats such as Atom and RSS have empowered a new generation of news-reading software that frees you from the tedious, repetitive, and inefficient reliance on bookmarked Web sites and Web browsers to stay in touch with news of interest. Without the machine-understandable content representation embodied by RSS and Atom, these news readers could not exist.
In short, imagine a World Wide Web where a Web site could only contain content authored by humans exclusively for that site. Content could not be shared, remixed, and reused between Web sites. To intelligently aggregate, combine, and act on Web-based content, agents, crawlers, readers, and other devices must be able to read and understand that content. This is why it's necessary to take an in-depth look at the different mechanisms available today to improve the interactions between machines and human-generated content in Web applications.
Consider a scenario from the page of the Semantic Web activity at the W3C. Most people have some personal information that can be accessed on the Web. You can see your bank statements, access your online calendar applications, and post photos online through different photo-sharing services. But can you see uour photos in your calendar to remind yourself of the place and purpose of those photos? Can you see your bank-statement line items displayed in your calendar too?
Creating new data integrations of this sort requires that the software driving the integration be able to understand and interpret the data on particular Web pages. This software must be able to retrieve the Web pages that display your photos from Flickr and discover the dates, times, and descriptions of your photos. It also needs to understand how to interpret the transactions from your online bank statement. The same software must be able to understand various views of your online calendar (daily, weekly, and monthly), and figure out which parts of the Web page represent which dates and times.
The example in Figure 1 shows how embedded metadata might benefit your end-user applications. You begin with your data stored in several places. Flickr hosts your photographs, Citibank provides access to your banking transactions, and Google Calendar manages your daily schedule. You wish to experience all of this data in a single calendar-based interface (missMASH), such that the photos from your Sunday at the State Park appear in the same weekly view as your credit card transaction from Wednesday's grocery shopping. To do this, the software that powers missMASH must have some way to understand the data from your Flickr, Citibank, and Google Calendar accounts in order to remix the data in an integrated environment.
Figure 1. A mashup of banking and photos in a calendar view
A full spectrum of technologies give application authors the ability to do such integrations. Some of the technologies are well established, while others are still fledgling and not as well understood. The barriers to entry for the technologies vary, and some of the technologies will provide a higher level of utility than others.
In this series, we'll examine how you might implement the scenario discussed above using the different mechanisms available for human-computer coexistence on the Web. We will introduce and explain each technology, then show how the technologies might be used to integrate bank statements, photos, and calendars. We will also evaluate the strengths and weaknesses of each technology, and hopefully make it easier for you to decide between the options.
When you embark on a comparison of technologies, it is helpful to first outline the criteria to evaluate the technologies. The list below describes properties desirable of methods that facilitate a Web that is both human-friendly and machine-readable. We don't expect to find a single technology that succeeds in all these facets, nor should we. In the end, we hope to build a matrix that will aid you in choosing the right tool for the job. Here are the criteria we will use:
- Authoritative data
- Whatever the particulars are for a technique, you end up with
machine-readable data that supposedly is equivalent to what
shows up on the corresponding human-friendly Web page. You want
the techniques that get you to that point to ensure the
fidelity of this relationship; you want to be able
to trust that the data really corresponds to what you
read on the Web page.
The authority of the data is one axis along which to measure this trust. We consider one representation of data to be authoritative if the representation is published by the owner of the data. A data representation might be non-authoritative if it is derived from a different representation of the data by a third party.
- Expressivity and extensibility
- If you use one
technique to create a Web page with both a
human-representable and machine-readable version of
directions to your home, you'd rather not need a
different technique to add the weather forecast to
the same Web page. We hope that this criteria will help
minimize the number of software components involved in any
particular application, which in turn increases the
robustness and maintainability of the application.
Along these lines, we appreciate it if the techniques acommodate new data in an elegant manner, and are expressive enough to instill confidence that in the future we can represent previously unforeseen data.
- Don't repeat yourself (DRY)
- If the same datum
is referenced twice on a Web page, you don't want to
write it down twice. Repetition leads to inconsistencies when
changes are required, and you don't want to
jump through too many hoops to create Web pages that
both humans and computers can understand.
This does not necessarily apply to repetition within multiple data representations, if those representations are generated from a single data store.
- Data locality
representations of the same data within the same document
should be in as self-contained a unit of the document as
possible. For example, if one paragraph within a large
Web page is the only part of the page that deals with the
ingredients of a recipe, then you want a technique to contain
all of the machine-readable data about those ingredients to that
In addition to being easy to author and to read, data locality allows visitors to the Web page to copy and paste both the human and machine representations of the data together, rather than requiring multiple discontinuous segments of the Web page be copied to fully capture the data of interest. This, in turn, promotes wider adoption and re-use of the techniques and the data. (Just ask anyone who has ever learned HTML with the generous use of the View Source command.)
- Existing content fidelity
- You want the techniques to work without requiring authors
to rewrite their Web sites. The more that a technique can make
use of existing clues about the intended (machine-consumable)
meaning of a Web page, the better. But a caveat: techniques
shouldn't be so liberal with their interpretations that they
ascribe incorrect meanings to existing Web pages.
For example, a new technique that prescribed that text enclosed in the HTML u tag represent the name of a publication might lead to correct semantics in some cases, but might also license many incorrect interpretations. While the markup in this example might be authoritative (because it originates from the owner of the data), it is still incorrect because the Web page author did not intend to use the u HTML tag in this manner.
- Standards compliance
- You should be able to use techniques without losing the ability to adhere to accepted Web standards such as HTML (HTML 4 or XHTML 1), CSS, XML, XSLT, and more.
- Creating a Web of data that humans can read and machines can process is of little value if no tools understand the techniques used. We prefer techniques that have a large base of tools already available. Failing that, we prefer techniques for which one can easily implement new tools. Tools should be available both to help author Web pages that make use of a technique, and also to consume the machine-readable data as specified by the technique.
- Overall complexity
- The Web has been around for
a while now, and it's only recently that the need to share data
between humans and machines is receiving a lot of
attention. The vast landscapes of content available on
the Web are authored and maintained by a wide variety of
people, and it is important that whatever techniques we
promote be easily understandable and adoptable by as many Web
authors as possible.
The best techniques are as worthless as no technique if they are too complex to adopt. The most desirable techniques will have a low barrier to entry and a shallow leaning curve. They should not require extensive coding to implement, nor require painstaking efforts to maintain once implemented.
This section provides a brief introduction to the major techniques used today to enable machine-human coexistence on the Web. Subsequent articles in this series will explore these techniques in detail.
In this world view, the data is represented on the Web with (at least) two addresses (URLs): one address holds a human-consumable format, and one a
machine-consumable format. Technologies to enable the parallel Web include the HTML
link element and HTTP
content negotiation. Those involved in the creation of the HTML
specifications saw the
need for two linking elements in HTML: the
a element that
is visible and can only appear in the body of a Web page, and the
link element that is invisible and can only appear in the
head of a Web page. The HTML specification designers reasoned
that agents, depending on their purpose and target audience, would
interpret the links in the head based on their
(relationship) attribute and perform interesting functions with
For example, Web feeds and feed readers have empowered humans to keep up with the vast amount of information being published today. When you use a feed reader, you initialize it with the address (URL) of an XML file -- usually an RSS or Atom file. In most cases, the machine-consumable data within such a feed has a parallel URL on the Web, where you can find a human-readable representation of the same contentd. There are a variety of techniques to achieve this parallel Web in a useful and maintainable fashion. Part 2 of this series will discuss a parallel Web in detail, including the benefits and drawbacks of having the same data available at more than one Web address. Future installments of this series will cover techniques that allow multiple data representations to be contained within a single Web address.
Algorithmic approaches encompass software that produces machine-consumable data from human-readable Web pages by application of an arbitrary algorithm. In general, the algorithms tend to fall into two categories:
- Scrapers, which extract data by examining the structure and layout of a Web page
- Natural-language processors, which attempt to read and understand a Web page's content in order to generate data
These techniques are designed for situations where the structure or content of a Web page is highly predictable and unlikely to change. The algorithms are usually developed by the person seeking to consume the data, and as such they are not governed by any standards organization. Often, these algorithms are an integrator's only option when faced with accessing data whose owner does not publish a machine-readable representation of the data. Stay tuned for details on the algorithmic approach in Part 3 of this series.
Microformats are a series of data formats that use
existing (X)HTML and CSS constructs to embed raw data within the
markup of a human-targeted Web page. Microformats are guided by a
set of design principles and are developed by community consensus.
The Microformats community's goal is to add semantics to the
class attribute, originally intended
mostly for presentation.
As with the algorithmic approach, microformats differs from many others in our series because it is not part of a standards process in organizations such as the W3C or IETF. Instead, their principles focus on specific problems and leverage current behaviors and usage patterns on the Web. This has given microformats a great start towards their goal of improving Web microcontent (blogs, for example) publishing in general. The main examples of microformat success have been the hCard and hCalendar specifications. These specifications allow microcontent publishers to easily embed attributes in their HTML content that allow machines to pick out small nuggets of information, such as business cards or event information from microcontent Web sites.
Gleaning Resource Descriptions from Dialects of Languages (GRDDL) allows Web page publishers to associate their XHTML or XML documents with transformations that take the Web page as input and then output machine-consumable data. GRDDL can use XSL transformations to extract specific vocabularies from a Web page. GRDDL also allows the use of profile documents, which in turn reference the appropriate transformation algorithms for a particular class of Web pages and data vocabularies.
GRDDL has great potential for bridging the gap between humans and machines by enabling authoritative on-the-fly transformations of content. While this is similar to the parallel Web, there are significant differences. GRDDL provides a general mechanism for machines to transform content on demand, and GRDDL does not create permanent versions of alternative data representations. The W3C has recently chartered a GRDDL working group to produce a recommended specification for GRDDL.
Embedded RDF is a technique for embedding RDF data within XHTML documents using existing elements and attributes. eRDF attempts to balance ease of markup with extensibility and expressivity. Along with RDFa and, to a lesser extent, GRDDL, it explicitly makes use of the Resource Description Framework (RDF) to model the machine-consumable data that it encodes. eRDF shares with microformats the principle of reusing existing vocabularies for the purpose of embedding metadata within XHTML documents. eRDF seeks to scale beyond a small set of formats and vocabularies by using namespaces and the arbitrary RDF graph data model.
Embedded RDF is not currently developed by a standards body. Similar to microformats, eRDF is capable of encoding data within Web pages to help machines extract contact, event, and location information (and other types of data) to enable powerful software agents.
RDFa, formerly known as RDF/A, is another mechanism for including RDF data directly within XHTML. RDFa uses a fixed set of existing and new XHTML elements and attributes to allow a Web page to contain an arbitrary amount and complexity of machine-readable semantic data, alongside the standard XHTML content that is displayed to humans. RDFa is currently developed by the W3C RDF-in-XHTML task force, a joint product of the XHTML and Semantic Web Deployment working groups.
As with eRDF, RDFa takes advantage of namespaces and the RDF graph data model to enable the representation of many data structures and vocabularies within a single Web page. RDFa seeks to be a general-purpose solution to the inclusion of arbitrary machine-readable data within a Web page.
This article motivated and explained the challenge of creating a World Wide Web that is accessible to both humans and to machines. We developed an example integration scenario that could be enabled by any of the myriad of coexistence mechanisms. We also discussed the criteria with which to compare and evaluate the techniques that we will cover in more detail in the rest of this series.
Stay tuned for Part 2, which will explore in detail the widely used parallel Web technique.
- eBay Web Services: Explore this resouce for usage statistics.
- Citibank: Check out the functions for online banking, investing, loans, and more.
- GRDDL: Read the Gleaning Resource Descriptions from Dialects of Languages specification about a complementary mechanism for the RDF/XML syntax.
- RDF in XHTML: Learn more from this Task Force of the W3C.
- Microformat hCard Specification: Try this simple, open, distributed format for representing people, companies, organizations, and places.
- Microformat hCalendar Specification: Explore a simple, open, distributed calendaring and events format, based on the iCalendar standard .
- Tim Berners-Lee: Vist the Web site for the Director of the World Wide Web Consortium, Senior Researcher at MIT's CSAIL, and Professor of Computer Science at Southampton ECS.
- Links: Explore the HTML specification for more on hypertext and interactive documents.
- Top 500 list from Alexa: Look at the traffic rankings for sites in the United States.
- W3C: Dig into a multitude of technologies at the World Wide Web Consortium.
- IETF: Vist the Internet Engineering Task Force.
- RDF: Review the specification for the Resource Description Framework.
- Semantic Web Activity: Learn more about the semantic Web at this homepage on the W3C.
- Atom 1.0 extensions (James Snell, developerWorks, October 2005): Read more about a number of proposed extensions to the Atom 1.0 Syndication Format. in this 2-part series.
- developerWorks Web development zone: Expand your site development skills with articles and tutorials that specialize in Web technologies.
- developerWorks technical events and Webcasts: Stay current in your field with these technology sessions.
Get products and technologies
- Google Calendar: Try this online calendar to organize your schedule and share events with friends and family.
- eBay: Check out this popular online auction Web site.
- Flickr: Visti this popular online photo sharing application and Web 2.0 site.
- Microformats: Look into a set of simple and open formats designed for humans first and machines second.
- RDFa: Dig into a specification designed to represent all of the concepts in RDF in XHTML.
- Embedded RDF: Check out this specification that allows you to embed a subset of RDF in HTML pages.
- developerWorks RSS and Atom feeds: Find out more and build your own.
- AtomEnabled.org: Visit this excellent site with information on Atom, a content syndication format and protocol.
- IBM trial software: Build your next development project with software available for download directly from developerWorks.
- developerWorks blogs: Get involved in the developer community.
- developerWorks forums: Participate in any of several Web-centered forums.
Lee Feigenbaum is an advisory software engineer at IBM's Internet Technology Group in Cambridge, MA. His current work focuses on researching and developing strategies and software for leveraging Semantic-Web technologies within enterprises. Past research and development topics have included instant-messaging software, structured annotation systems, and DHTML client runtimes. He writes regularly about the Semantic Web and other topics in his blog, TechnicaLee Speaking.
Elias Torres is a senior software engineer at IBM's WebAhead Lab in the CIO organization in Cambridge, MA. His current research focus areas include the Semantic Web, Web 2.0, and Social Software. He earlier developed, deployed, and evangelized two key collaborative services inside IBM called Blog Central and Wiki Central based on Open Source software. He is an active committer at the Apache Software Foundation open source projects, and participates in the development of Semantic Web standards at the W3C. He maintains a blog at http://torrez.us.