Skip to main content

XML Watch: Describe open source projects with XML, Part 1

Keep project information up-to-date with the DOAP vocabulary

Edd Dumbill (edd@usefulinc.com), Chair, XTech Conference
Edd Dumbill is managing editor of XML.com and the editor and publisher of the XML developer news site XMLhack. He is co-author of O'Reilly's Programming Web Services with XML-RPC, and co-founder and adviser to the Pharmalicensing life sciences intellectual property exchange. Edd is also program chair of the XML Europe conference. You can contact him at edd@xml.com

Summary:  In this installment, Edd Dumbill starts the development of a vocabulary to describe open source software projects, setting goals and deciding among XML and RDF schema technologies.

View more content in this series

Date:  26 Feb 2004
Level:  Intermediate
Activity:  1952 views

One of the great things about open source software is its essential democracy: Anyone can easily start their own project, and they often do! Unfortunately, it can be difficult for users to locate software that suits their purposes. This need has been met over time by different software registries. Perhaps the best known and longest-running of these is Freshmeat, but there are many more, often meeting more specialized needs. For example, the Free Software Foundation's FSF/UNESCO Free Software Directory, the GNOME Software Map, or the BioInformatics Software Map (see Resources for links to all of these).

So many registries now exist that keeping them up to date has become a real problem. The release cycle for diligent software maintainers often involves visits to several Web sites to keep the information up to date, not to mention updating their own Web sites. However, such maintainers are few and far between, and it's not uncommon to find out-of-date information in a registry. That this data gets out of date is unsurprising when you consider the aspects that many modern software projects involve: mailing lists, IRC channels, Web sites, wikis, CVS repositories, and so on.

This article starts the development of a solution that meets the need of keeping software project information up-to-date: a vocabulary that can be used in an XML document for the Web-wide exchange of project details. In this first installment, I will outline the scope of the project, make implementation technology choices, and look at relevant existing work.

Goals, scope, and strategy

Every project needs a name. I have chosen to take inspiration from FOAF (Friend-of-a-friend) and christen the project DOAP, an abbreviation for "description of a project". Now that 90% of the difficult choices have been made, onto the rest!

A project such as this can easily get out of hand, with adverse results. If you create something whose implementation is more onerous than or comparable to the effort currently required with the status quo, you are unlikely to succeed, whatever benefit your XML vocabulary may reap. The Web is littered with failed projects that tried to do too much. It's worth limiting yourself to a realistically small set of goals.

The limited requirements for the first iteration of the vocabulary will include the following:

  • Internationalizable description of a software project and its associated resources, including participants and Web resources
  • Basic tools to enable the easy creation and consumption of such descriptions
  • Interoperability with other popular Web metadata projects (RSS, FOAF, Dublin Core)
  • The ability to extend the vocabulary for specialist purposes

Specifically not in scope for the first iteration is the description of software releases. Work on this can be investigated as a follow-up initiative. Additionally, planning data internal to the project such as task assignments or milestones is out of scope. You don't want to go so far as to reinvent Microsoft Project!

Use cases for project descriptions include:

  • Easy importing of projects into software directories
  • Data exchange between software directories
  • Automatic configuration for resources such as shared CVS repositories or bug trackers
  • Assisting package maintainers who bundle software for distributors

Technology choices

Despite many years of vocabulary development, the choice of technology remains an open question. Various popular vocabularies that have found widespread deployment have employed different ways of specifying the terms. Take a look at some of these to see if you can glean good practice, or any useful warnings. See the Resources for links to all these specifications.

  • Dublin Core Metadata Element Set: This popular library metadata application uses a technology-independent means of expression, with accompanying specifications stating how the elements could be expressed in RDF/XML, HTML meta tags, and W3C XML Schema. Dublin Core has been very successful, but has some ambiguity in the interpretation of the semantics of its terms, leading to some interoperability issues. For example, Creator is specified as "an entity primarily responsible for making the content of the resource. Examples of a Creator include a person, an organisation, or a service. Typically, the name of a Creator should be used to indicate the entity." For computing purposes, the term "name" may have a pretty broad interpretation, and the definition above is only really effective for human consumption of the metadata. Neither is it clear whether the creating entity must be, for example, a human or a collection of humans.
  • RSS (RDF Site Summary/Really Simple Syndication): The many flavors of this specification have chosen different routes: Version 0.91 used an XML DTD with additional prose; version 1.0 used prose plus examples, with an informative RDF schema; and 2.0 is specified as prose with examples. Underspecification has been a persistent problem in RSS interoperability.
  • ebXML: Vocabularies from this electronic business project typically use a large amount of prose in their formal specification; examples, XML DTDs, and schemas are also provided.
  • HTML: Enormously successful by any standard, HTML proliferated largely on the back of ad-hoc examples. By the time it was more formally specified, it was too late. The cleanup operation has taken years. Poor interoperability has cost dearly in terms of time wasted on testing Web sites in multiple browsers, and indeed in supporting legacy behavior in the browers themselves.

As DOAP is primarily intended for computer consumption, it seems plain that some kind of machine-readable schema for the vocabulary will be required. On the other hand, as humans will create the data, it is equally important that the human-readable information is sufficient to avoid interoperability problems through underspecification. One of DOAP's explicit goals is an interchange vocabulary, so it's important to minimize data loss through incompatible use of the terms. If you've ever tried to synchronize vCard data between devices you will know that, despite the data ostensibly conforming to the specification, each implementation has its own quirks that need workarounds.


XML or RDF?

One of the really attractive aspects of Dublin Core (DC) is its mapping to a variety of expressions in RDF, XML, and HTML. Such generality warms the heart of any software developer. That notwithstanding, it is probably fair to say that the majority of DC deployment on the Web has been within RDF.

The example of ebXML demonstrates that where interoperability and interchange are critical, a well-defined serialization is a must. This presents a thorny choice -- whether to choose an RDF or XML representation. For metadata applications, RDF is generally considered the first-choice language. RDF, unfortunately and undeservedly, has a reputation as a bit of bogieman due to its additional constraints over XML: You can't just write a tag soup with RDF and expect it to work, and you don't get the full benefit of using RDF unless you use RDF-aware processing tools. Many battles were fought over this in the development of RSS 1.0, which as a result tries to hide away its RDF-ness as much as possible.

A straight XML serialization has its difficulties, too. You have a choice of schema languages with which to define your document structure, each with different levels of expressivity and tool support. DTDs, while arguably still most widespread, do not offer a very expressive means of defining a document, and are generally held to be yesterday's technology. W3C XML Schema (WXS) is more flexible, but is a heavyweight solution whose acceptance is highest in the commercial software world -- it is certainly not human readable. RELAX NG is a promising newcomer, perhaps more understandable than WXS, and boasts easy conversion to WXS. It also has a human-readable compact syntax, making it more easily written by hand. Should an XML route be taken, RELAX NG seems the best bet, as it is readily converted to the other two, and easier to understand.

XML-only serialization presents some difficulties in the specification area. Whereas XML defines well the syntax of a document, it says nothing about the semantics of the elements. RDF Schema (and its bigger brother OWL, the W3C Ontology Language) allows you to say that a software project maintainer is a subclass of the Dublin Core term "creator". Any RDF application that knew how to handle Dublin Core could then make at least basic use of DOAP data. By contrast a straight XML document would have no meaning for an application that didn't have explicit code to process the DOAP namespace, even if it had the corresponding schema.

Lastly, a big unsolved problem remains in XML -- the namespace-mixing issue. Given two arbitrary vocabularies from different namespaces, how might these be mixed to create a compound vocabulary? This problem has no general solution, meaning that except where the solution is explicitly stated by means of another schema combining the two, each XML vocabulary remains an island. On the other hand, RDF has a well-specified solution. So, if you count mixing DOAP with other namespaces a priority, RDF may be a preferable solution.

To summarise: Should you choose straight XML, which may be simpler for people to understand, or RDF, with its flexibility and accompanying constraints?

Regular readers of my columns may already have guessed that my proclivities lie with the use of RDF. That is indeed the way that this project will proceed, since RDF is so well-suited to metadata applications. However, the problems outlined above will not be forgotten, and along the way I will look for ways to mitigate the perceived complexity of using RDF. It will definitely be advantageous if DOAP can be processed using normal XML tools.

For the purposes of automated consumption, RDF Schema will be used to specify the DOAP vocabulary. It will be augmented by prose as much as possible. The FOAF specification (see Resources) takes this route, with some success.


Existing work

Now that the technology choices have been made, it's important to see what existing work relates to the goals of the project. Having reviewed this work, I'll make a start on the definition of the vocabulary in the next article in this series. Links to this work can be found in Resources, and are recommended reading.

  • Freshmeat XML export: The Freshmeat.net software registry provides an XML export of all its data, updated daily. They also provide a DTD for the XML format used. Leigh Dodds has done some work transforming this export into data using terms from FOAF.
  • Open Source Metadata Framework: This project focuses on metadata for documentation for open source projects, and thus shares some significant goals with DOAP. It is in wide deployment as part of the ScrollKeeper Open Documentation Cataloging Project.
  • PRJ Project Vocabulary: This vocabulary, created by Danny Ayers, is actually aimed at being a general project management vocabulary, irrespective of domain.
  • CPAN2FOAF: CPAN, the Comprehensive Perl Archive Network, is a large repository of Perl software. Dan Brickley has worked on converting authorship metadata into FOAF/RDF.
  • Description of a Software Project: This is the beginning of a DOAP-like vocabulary, created by Max Völkel.
  • RPMFind: This software location service uses RDF descriptions of software packaged using the RPM format. The metadata is very detailed about each software release.

Resources

About the author

Edd Dumbill is managing editor of XML.com and the editor and publisher of the XML developer news site XMLhack. He is co-author of O'Reilly's Programming Web Services with XML-RPC, and co-founder and adviser to the Pharmalicensing life sciences intellectual property exchange. Edd is also program chair of the XML Europe conference. You can contact him at edd@xml.com

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12375
ArticleTitle=XML Watch: Describe open source projects with XML, Part 1
publish-date=02262004
author1-email=edd@usefulinc.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers