In the first article in this series, I introduced the project to build DOAP ("Description of a project"), an RDF/XML vocabulary for describing open source projects. DOAP will meet the needs of project maintainers who find they must register their software at myriad Web sites, and for anyone seeking to exchange such data. That article outlined existing work in this area, and defined the boundaries of the project.
This time, I will distill a set of terms that are candidates for inclusion in this vocabulary and talk about some of the difficulties inherent in specifying it. I will show you that the admirable aim of being able to share DOAP descriptions globally has some consequences for the design of this vocabulary.
Table 1 shows a survey of metadata terms used in various software directory Web sites and also in the open source metadata framework. For some terms, I have made a closest-approximation categorisation -- for example equating "lead developer" with "maintainer." I have also excluded items relevant to software releases, which are out of scope for this stage of DOAP. The table is useful as a broad-brush survey of terms in circulation.
Table 1. Commonly used metadata terms in open source software directories
|rel lead developer||y||y||y||y|
In addition to the terms found in Table 1, here are a few other items that are commonly used in open source projects (mostly of a more social nature) to consider:
- Wikis: Often used to host development documentation.
- Other kinds of source repositories: Subversion, Arch, and BitKeeper are commonly used in addition to CVS.
- Additional project roles: Include at least "translator" and "tester."
- PGP public key: Many software releases are digitally signed with PGP to give some guarantee of authenticity.
Most of the sites surveyed in Table 1 employ a very simple model where the project forms the sole entity. The metadata items are then simple properties of this entity. Later, you will see that a case can be made for some of those property values to be complex entities themselves. An example of this is properties whose domain (the collection of permissible values for the property) is people. It is clearly desirable to record more than just a name to identify a person. However, it's important to strike a balance between completeness and overcomplicating the vocabulary.
Figure 1 shows a partial entity-relationship diagram for the vocabulary. These diagrams can prove helpful in cases that have multiple interacting entities. An alternative to entity-relationship modeling is to use UML, the Unified Modeling Language. Several other articles have examined in depth the application of UML to creating XML vocabularies, including those using W3C XML Schema (see Resources).
In this case, you have few entities and many attributes. You can probably manage well without constructing a complete set of diagrams. Due to the simple nature of the task, most of the challenges lie not in the modeling itself, but in making DOAP easy to create and process.
Figure 1. A partial entity relationship diagram for the new vocabulary.
So far you have accumulated a set of candidate terms to be included in your vocabulary. Choosing which to use is a matter of design, trial-and-error, and personal preference. In due course, you will need to construct some example usages in order to get a feel for how the vocabulary will work. You'll also need to test out your design ideas. Before then you must consider some of the problems you'll come up against while data modelling.
To efficiently manipulate the data expressed in your vocabulary, you need to nominate at least one property as identifying a project. This is analogous to making a column in a database a primary key. Unlike in the case of a database, however, a locally unique key will not do. The project key must be globally unique if DOAP descriptions are to be shared on the Web. Yet how can you administer this? One of the basic principles of DOAP is that it is decentralized. Descriptions can be created and distributed without registering on a particular Web site.
On the Web, a common way of globally identifying an item is to give it a URI. As every software project has a home page on the Web it seems sensible to nominate the home page URI as the identifying property for a project. The only other major contender for the identifying property is the project's name. The weakness of using the name lies in the lack of an authority to appeal to in the case of duplicates. It is not uncommon for projects to choose the same name. In such cases, confusion often arises and the conflict may have no happy conclusion. With homepage URIs (that is, URLs), the global authority of the DNS system ensures no name clashes.
Using home page URLs has one obvious disadvantage. In the ideal world, cool URIs don't change (see Resources). In the real world they change all the time. A project maintainer may change ISPs or host institutions. The project might get a new maintainer with different resources. Or it may just be that a Web site is reorganised. Clearly you do not want all DOAP descriptions to be invalidated if such a thing happens.
To solve this problem you need an old home page property. A project can have more than one of these properties, which can be added whenever the site is moved. You can then consider the old home page also to be an identifying property. The constraint is that no other project must ever use the old home page address.
How does this work out? Imagine you have descriptions of the same project contained in two independent DOAP files:
- One refers to a project that gives the home page property as http://example.org/xmlparser
- The second gives the home page property as http://example.org/projects/xml/parser and includes the URL http://example.org/xmlparser as an old home page property
Any processing agent can then figure out that these are the same two projects.
This plan has been shown to work well in the FOAF (Friend-of-a-friend) project for expressing personal information and social networks. Find more details under the heading "Merging FOAF descriptions" in my article "Finding friends with XML and RDF."
The desire for DOAP to be decentralized and global raises issues in areas other than the unique identification of projects. The range of values that properties can take must also be predictable in some way in order to perform useful processing over the global collection of DOAP information. To illustrate this, I'll take a look at the license property.
Data design 101. Humans can tell that there is no difference in intended meaning between GPL2, GNU General Public License, Version 2, and even http://www.gnu.org/licenses/gpl.html. Computers obviously cannot. The conventional database-inspired solution to a problem like this is to settle on an agreed-upon set of codes or abbreviations for the various licenses. Additionally, you will need an extension mechanism for when a custom license is used.
Unremarkable so far. Here's where an interesting aspect of using RDF/XML comes into play. In RDF/XML, the property may take two kinds of values: one is a resource, identified by a URI, and the other is a string literal. These literals may be datatyped, so you could define a W3C XML Schema enumeration to govern the permissible values (see Resources). The license property could then be one of, for example, GPL, BSD, Apache, and so on. If it is "Other" then you could add an extra text field to describe the alternative license.
The disadvantage of this approach is that an extra burden is placed upon those who process a DOAP file. They now need to take into account the presence of an extra schema and import the heavy machinery required to do W3C XML Schema validation. Even then, all they get is an opaque string that must be augmented with extra information if it is to be useful to a human observer. Using an enumeration also creates extra overhead for the maintainers of the DOAP vocabulary, as there is now an extra schema to take care of and distribute.
You gain extra flexibility if you use a resource instead of a literal. You can then allocate URIs in space you control to denote licenses. For example, http://example.org/doap/licenses/GPL could be used for the GNU General Public License. (The domain "example.org" is used illustratively here.) You can also put a Web page at that location with further information about the GPL, including its full text. As an additional courtesy, you can publish the complete list of licenses you support at http://example.org/doap/licenses/. This adds no extra overhead for a DOAP processor. It is as easy to look for the string http://example.org/doap/licenses/GPL as it is the string GPL. You can make things even easier for processors if you create an RDF file hosted at the .../doap/licenses/ URL that contains a computer-processible license list, augmented with handy data such as labels and descriptions for each license.
This technique also neatly solves the extensibility problem. Imagine that you, Acme Corp., create your own Acme Open Source License. All you need to do is guarantee that you control a URI similar to http://acme.com/license/AOSL and use that as the value of the license property in DOAP descriptions. And if you're a good citizen you'll put an explanatory Web page at that URI.
Using resource URIs in this way has two disadvantages. The first is the simple matter that it's easier to type "GPL" than the full URI suggested above. This is not a large problem and can be ameliorated somewhat by providing shortcut syntaxes or tool support later in the project: In RDF, labels can be used to provide human-readable interpretations of resource URIs. It's certainly less of a problem than either having a free-for-all string or the burden of bringing in schema validation.
The second and more serious disadvantage is the legimate concern that you as DOAP's maintainer might lose control of the URI-space http://example.org/doap. While in the short term the URIs could continue to be used without invalidating their status as opaque identifiers, considerable confusion could arise if the content at that URI is changed or removed. Two common means of addressing this are available today:
- Use a service such as purl.org (see Resources) that makes some warranty of longevity for URIs registered with it, or affiliate DOAP with a standards organisation such as OASIS that can make a similar guarantee.
- Use a Uniform Resource Name (URN -- see Resources) rather than a URL.
URNs provide a managed namespace through the Internet Assigned Numbers Authority (IANA). The portion of the URN namespace allotted to DOAP can then be managed through documents submitted to the Internet Engineering Task Force (IETF). Unfortunately, this process is indescribably unwieldy and probably unsuitable for a project such as this.
Finally, be aware that it may not always be best to use a resource to represent such constants. Resources are best suited to situations where extensibility and further investigation are helpful. Occasionally, you may trade this off against the convenience of using short strings.
It is useful to look at the approach taken by the Creative Commons project (see Resources). Creative Commons allows the application of flexible licenses to electronic media, intended to expand the body of creative work available for others to build on and share. They take the approach of denoting a license through a URI as advocated above.
This article has attempted to address some common issues, inspired in part by experience with the FOAF vocabulary. Taking the Web-wide perspective that RDF brings has both advantages and disadvantages. This situation can be characterised generally as a trade-off between verbosity and flexiblity. Yet as any programmer knows, it would be a mistake to optimise prematurely. I will follow the Web-wide ethos to its conclusion, and then show you what may be done to lower the barrier to entry for new DOAP users.
The next article in this series will continue the design of the vocabulary to a point where you can start experimenting with tools and test data.
- Get a good overview of the use of UML in creating XML data models with Dave Carlson's article "Modeling XML Vocabularies with UML."
- Will Provost's article "UML For W3C XML
Schema Design" goes into more detail about how to use UML with the W3C's XML Schema language.
- Learn more about the development of a W3C XML Schema-based vocabulary for financial
reporting in "Design
of the XBRL specification," a paper delivered at XML Europe by David Vun Kannon and Yufei Wang. Rational Rose was used as a modeling tool.
- Check out Tim Berners-Lee's "Cool URIs don't
change" article as he argues passionately that "URIs don't change: people change them."
- Read Edd Dumbill's article "Finding friends with XML and RDF" (developerWorks, June 2002), which explains how identifying properties can be used to merge independent descriptions.
- Find out how W3C XML Schema data types can be used in conjunction with RDF.
- Discover more about persistent URLs provided as a service on purl.org, which is managed by the Online Computer Library Center (OCLC).
OCLC is committed to the longevity of the service, which permits the registration of a Persistent URL with a redirect to its current "real home" on the Web.
- Learn about URNs as defined by a series of specifications submitted to the IETF.
- Take a look at the Creative Commons project, which has a framework for describing and applying licenses to digital media. They use URIs to denote a license, and also use RDF to express the rights attached to a media item. Uche Ogbuji examines Creative Commons in his Thinking XML column, "The commons of creativity" (developerWorks, May 2003).
- Review other articles in this series part 1 introduces the DOAP project (developerWorks, February 2004) while part 3 presents a schema for the new vocabulary and example project descriptions(developerWorks, June 2004).
- Find more XML resources on the developerWorks XML zone. Read previous installments in the XML Watch column series.
- Browse for books on these and other technical topics.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Edd Dumbill is managing editor of XML.com and the editor and publisher of the XML developer news site XMLhack. He is co-author of O'Reilly's Programming Web Services with XML-RPC, and co-founder and adviser to the Pharmalicensing life sciences intellectual property exchange. Edd is also program chair of the XML Europe conference. You can contact him at email@example.com.