Technical standards in education, Part 3: Open repositories for scholarly communication

Enhancing access to research

Universities and research institutions use open repositories to enhance how they manage the outputs of their research activities, and make that research available to a worldwide audience. This article outlines the history and challenges of scholarly communication in today's open environment. It describes some of the different standards and technical challenges relating to collecting, storing, preserving, transferring, and providing access to research using open repositories.


Stuart Lewis (, Information Systems Professional

Photograph of Stuart LewisStuart Lewis has worked with open repositories in various roles over the past six years. Currently, he is the Digital Development Manager at The University of Auckland Library in New Zealand. Also, he is the Community Manager of the SWORD project, which continues to develop the SWORD repository deposit standard. Stuart is one of the core developers and committers for the DSpace open source repository platform. He maintains the EasyDeposit SWORD client creation toolkit, and the Repository66 mashup map of open repositories. Prior to working in Auckland, Stuart worked in a UK university where he led a technical team that undertook funded research into open repositories, including open access and data repositories. He was a key player in the creation of the UK's Repository Support Project (RSP), a support and guidance service to higher education institutions with respect to open repositories. Stuart blogs at

19 January 2011

Also available in Vietnamese


Research is the endeavor of seeking new knowledge and improvements in life and society. It covers subjects as diverse as physics or history, genetics or kinetics, the desire to create new construction materials or to understand the human mind. The world of scholarly communication sits at the heart of research. It allows researchers to disseminate their findings to others through the use of published journal articles, books, conferences, or new media formats. Without this dissemination, the value of the research could be unrealized.

These methods of dissemination have powered research and research institutions such as universities, government laboratories, and charitable research trusts for hundreds of years. Royal Societies were some of the first to publish collections of research, followed by academic societies.

With over $300 billion spent on research and development annually in the U.S. alone (see Resources), the effective dissemination of research and research data is of increasing importance to those who fund the research.

One important element of the scholarly communication system is the use of peer-review. Before a paper is published, it is often subjected to peer-review where it is independently and anonymously scrutinized by a panel of experts in the field. This ensures the research findings are rigorous and relevant. Research outputs that have been subjected to peer-review are naturally held in higher esteem than those that are not.

But is the world of scholarly communication performing as well as it could?


Open repositories, shared standards, and common interoperability protocols are required to collect, preserve, and provide access to scholarly research. The key word in all of these aspects is open. Through the use of open standards and open systems, the scholarly communication world can operate online in more powerful ways than ever before, allowing research to become more accessible and reusable.

One of the buzzwords of the past few years is open:

  • open platforms
  • open architectures
  • open standards
  • open source

It is the same with scholarly communication. A plethora of open philosophies, systems, and standards have evolved, that many people believe can improve the way the system works.

Open access

While existing for many years, the open access movement has taken off over the past 15 years. As the introduction of the printing press allowed knowledge to be spread more rapidly than at any time previously, the rapid development and uptake of the World Wide Web holds the possibility of a similarly large impact on the way research findings are disseminated. However, many people in the field of research believe that this potential is not being realized because research is often locked up inside publications.

Research is often published through journals or periodicals that require subscription to access and read the contents. Libraries within research institutions spend large amounts of money on subscriptions to journals so that their researchers can have access to the published research findings. Therefore the research institution pays twice to access the research:

  1. The institution pays the researcher to undertake and to publish the research.
  2. The institution pays the publisher to access the published copy of the research.

As much of the research is funded either by the taxpayer via grants from government-funded research programs or by charitable research trusts, many believe that it is unfair, or even wrong, that the research must be paid for twice, with commercial publishers making profits from the system, and the taxpayer not having free access.

This belief spawned the notion of open access publishing. Research can either be self-archived for free by researchers online in open repositories, or a fee can be paid to the publisher to make the research outputs freely available without the need for a subscription.

Understandably there are disagreements within the research and publishing communities about open access and whether it is a good thing, how it should be funded, and whether it should be compulsory. However, many research funders think that funded research should be available for free and require the outputs of any research that they fund to be available in this way (see Resources).

Whatever your beliefs in this area, there are interesting technical challenges relating to collecting, storing, preserving, transferring, and providing access to research in an open fashion. Furthermore, over the past decade interest unlocking the research that is often hidden in unpublished works such as electronic theses and dissertations, data sets, and the so-called gray-literature has grown. By placing these research works online the research can be unlocked for access and use.

Open repositories

Open repositories were born from a desire to make research materials available online. In essence, an open repository is simply a database-driven website of files and descriptive metadata, but, to perform effectively, they usually are able to:

  • Ingest and store materials.
  • Accurately describe contained materials.
  • Manage materials and their descriptions.
  • Preserve materials over a long period of time.
  • Disseminate stored materials.

These abilities are summarized in the Reference Model for an Open Archival Information System (OAIS) ISO standard (see Resources). In this model, items are created by producers, used by consumers, and managed by the system. Open repositories provide these three core functions.

To describe their contents, repositories make use of metadata, that is, data about the data contained. There are several types of metadata used for different purposes. The most obvious is descriptive metadata that provides a description of the item itself. This typically includes fields such as title, creator, date, description, or publisher. Metadata needs to be encoded within an open standard to ensure that it can be read and understood. Commonly-used standards for descriptive metadata include Dublin Core and MODS (Metadata Objects Description Schema) (see Resources).

In addition to descriptive metadata, other forms of metadata can be stored. For example, Preservation metadata is intended to assist with preserving the item over a long period of time. Preservation metadata might include the file format of the item, the version of software required to view the item, the size of the item, the checksum of the item's files, and the software used to create the item.

As well as requiring an accurate description of an item, preservation depends on other activities that need to be undertaken by a repository. Some of these activities are simple, such as the regular checking of the file checksums of items to ensure that they are not suffering from bit-rot. Other activities may be more complex, such as migrating file formats when old formats become obsolete, or performing automatic file identification to classify file types.

For the dissemination of items, an important role played by the repository is the provision of persistent URLs. Persistent URLs are designed to ensure that anyone citing an item by its URL will be able to retrieve the item many years later by using the same URL. The provision of persistent URLs comes from two layers. The first and most important layer is the use of identifiers that uniquely identify a single item in the repository. An optional second layer employed by many repositories is the use of a third party persistent identifier service. For example, the DSpace open source repository platform typically makes use of the Corporation for National Research Initiatives' (CNRI) "Handle" system, although there are alternatives such as the Persistent-URL (PURL) system. These services work by employing an extra level of indirection. The PURL points to a third-party domain that in turn resolves the URL to the repository URL.

For example, redirects users to

The persistent URL is made up of three parts:

  1.— The URL of the persistent URL service
  2. /2292/— The identifier of the repository
  3. 5315— The identifier of the item within the repository

The aim of the persistent URL services is that if the repository changes in any way, for example, the software changes, the domain name changes, or the identifiers change, then the persistent URL handler can be updated to ensure that it redirects users to the new URLs.

The history of open repositories

Arguably, the first and most renowned open repository is the (pronounced "archive") repository of scientific pre-prints (see Resources). It was created in 1991 and falls within the classification of a subject repository, as it only holds items related to physics, math, computer science, and related subjects. holds over half a million pre-prints, which are articles that have been written but have not yet been subjected to peer-review or formally accepted for publication in a traditional journal or conference. In fast-moving research environments, researchers want to share their work in this way to circumvent the time it takes for traditional publishing to take place.

There are several other noted subject repositories for different disciplines. These include RePEc (Research Papers in Economics) and E-LIS (Eprints in Library and Information Science). See the Resources section at the end of this article for links.

Following wider interest in creating open repositories, an open source repository platform called EPrints was created in 2000. This software, developed at the University of Southampton School of Electronics and Computer Science, is written in Perl and is still one of the leading platforms used for creating open repositories.

In 2002, a collaboration between Hewlett Packard Research Laboratories and MIT released the open source DSpace repository platform. DSpace was developed in Java and JSPs, while more recent versions also include a Cocoon and XSLT user interface.

Another major player in the open source repository platform world is Flexible Extensible Digital Object Repository Architecture (Fedora), which was originally developed by Cornell University's Digital Library Research Group. Fedora differs from EPrints and DSpace by not having a full end-user interface included with the core platform. Consequently there are several projects or groups that develop and support different user interfaces.

A more recent entrant to the open source repository arena is Microsoft® with their Zentity repository in 2008. The platform is built on the Microsoft technology stack including .Net and SQL Server.

The DSpace, EPrints, Fedora, and Zentity open source repositories are well maintained and backed by foundations or commercial services. In 2007, the DSpace Foundation and Fedora Commons were created as not-for-profit organizations to ensure the ongoing development and sustainability of these platforms. In 2009, the DSpace Foundation and Fedora Commons merged to create the Duraspace organization and are seeking ways to enable the two platforms to work closely together. EPrints runs a commercial service providing hosting, development, customization, and integration of their software.

In addition to the open source repository platforms, there are a number of commercial options. The largest provider of a commercial open repository system is BEPress with their Digital Commons product. There are hosted solutions based on the open source platforms, such as BioMed Central's Open Repository based on DSpace, and EPrint Services' hosted EPrints system.

The majority of the EPrints, DSpace, and Fedora installations could be classified as institutional repositories. These are typically provided by a research institute, university, or department for use by their researchers.

Another type of repository that is becoming more popular is the learning object repository. These hold copies of learning objects, or modules of information that can teach a particular skill or subject. Universities put a lot of effort into creating learning materials, and as with research, there are many good arguments for sharing these rather than keeping them locked up in a particular institution. Good examples of learning object or courseware repositories include MITs OpenCourseware site or Apple's iTunes U.

There are two websites that track the growth of open repositories:

  • ROAR: Repository of Open Access Repositories
  • OpenDOAR: Open Directory of Open Access Repositories

Figure 1 is a graph from ROAR that shows how the number of repositories has grown from humble beginnings in the 1990s, to a rapid increase over the last decade.

Figure 1. The growth of repositories and the records they contain (from ROAR)
The growth of repositories and the records they contain (from ROAR)

Figure 2 shows the map from the Repository66 website, which is a mashup of data from ROAR and OpenDOAR, and a Google map to display the distribution and type of repositories across the globe. It currently displays over 1,650 repositories, which contain more than 27 million items.

Figure 2. The distribution and types of repositories across the world according to
The distribution and types of repositories across the world according to

Open standards

To operate effectively together, open repositories require open standards. There are open standards for each of the common operations that can be undertaken with repositories. Such operations include harvesting, searching, depositing, authentication, and describing contents. Two particular open standards that have become core to repository interoperability are Open Archives Initiative's Protocol for Metadata Harvesting and SWORD.

Open Archives Initiate Protocol for Metadata Harvesting

The oldest and most widely known of the repository-related open standards is the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH). This protocol allows the contents of open repositories to be harvested by other systems. OAI-PMH provides a method by which larger aggregating search systems can provide a federated search service across multiple repositories. Rather than having to rely on screen-scraping, OAI-PMH allows search providers to harvest the raw structured metadata from repositories. This can yield more powerful search mechanisms as data is harvested in specific fields such as title, creator, abstract, and keywords.

OAI-PMH is an XML-based protocol that is based on a number of verbs. These verbs, along with extra arguments, are used to instruct the repository to describe its contents.

  • Identify: The repository can provide identity information about itself, including attributes such as its name, URL, contact email address and what options the interface supports.
  • ListMetadataFormats: OAI-PMH interfaces can expose the metadata of the contained items in different metadata formats or standards. This verb will list the metadata formats that the repository supports. Requests using other verbs can then specify the metadata format.
  • ListSets: A repository can partition its items into sets. A set can be analogous to a particular collection. This is useful if only a subset of a repository needs to be harvested. The ListSets verb lists all of the sets contained within the repository. Requests using other verbs can state which set to harvest.
  • ListRecords: This verb provides one of the main ways to harvest data. It will list all of the records in the repository that conform to the parameters passed. The metadataPrefix (containing a value from the ListMetadataFormats response) value must be given to state which metadata format the metadata should be expressed in. Optional parameters can be used to refine the harvest, including from dates and until dates, and a particular set. Harvesters will typically perform a full harvest, and then incremental harvests periodically, making use of the from parameter to only harvest recently added items.
    • Click to see code listing
  • GetRecord: It is possible to retrieve a single item using OAI-PMH by using this verb and specifying the identifier of the item to retrieve.
  • ListIdentifiers: This verb is identical to the ListRecords verb, except that only the identifiers of matching records are returned rather than their full records. This method is sometimes used if a harvester wants to harvest items individually. They can first get a list of items to harvest, and then retrieve each one individually using the GetRecord verb.
    • Click to see code listing

Listing 1 shows an example response of an OAI-PMH identifier. The response was retrieved from the following URL:

Click to see code listing

The code shows the two sections of a typical OAI-PMH response. The first is the header, and echoes back details of the request, and the time that is was made. The second section shows the response of the requested action. In this case it is a single record, split into its header that describes the item (its identifier, when it was last modified, and which sets it belongs to) and the metadata in the requested oai_dc format.

Listing 1. An example OAI-PMH response to a GetRecord request
<?xml version="1.0" encoding="UTF-8" ?>
<OAI-PMH xmlns="" 
     <request identifier="" 
metadataPrefix="oai_dc" verb="GetRecord">

              <oai_dc:dc xmlns:oai_dc=""
          <dc:title>If SWORD is the answer, what is the question? 
Use of the Simple Web-service Offering Repository Deposit protocol</dc:title>
                     <dc:creator>Lewis, Stuart</dc:creator>
                             <dc:creator>Hayes, Leonie</dc:creator>
                             <dc:creator>Newton-Wade, Vanessa</dc:creator>
                             <dc:creator>Corfield, Antony</dc:creator>
                             <dc:creator>Davis, Richard</dc:creator>
                             <dc:creator>Wilson, Scott</dc:creator>
                             <dc:description>Purpose - To describe the 
repository deposit protocol, Simple Web-service Offering Repository Deposit (SWORD), 
its development iteration, and some of its potential use cases. In addition, seven 
case studies of institutional use of SWORD are provided.  Approach - The paper 
describes the recent development cycle of the SWORD standard, with issues being 
identified and overcome with a subsequent version. Use cases and case studies 
of the new standard in action are included to demonstrate the wide range of practical
uses of the SWORD standard. </dc:description>
                              <dc:type>Journal Article</dc:type>
          <dc:identifier>Program: electronic library and information systems 43 
(4), 407-418. (2009)</dc:identifier> <dc:identifier>
             <dc:relation>Program: electronic library and information 
              <dc:rights>Items in ResearchSpace are protected by copyright, 
with all rights reserved, unless otherwise indicated. Previously published items 
are made available in accordance with the copyright policy of the 

One of the useful features from a development perspective when harvesting from OAI-PMH interfaces is that it is mandatory for a repository to be able to return metadata for all items in the oai_dc format. This format returns unqualified Dublin Core metadata. Dublin core is a relatively simple metadata schema made up of 15 elements including title, creator, description, and date. This ensures that a harvester can harvest content from any repository, as it is mandatory for this format to be supported.


While OAI-PMH offers a standardized way to harvest the contents of repositories, SWORD offers a standardized way to perform deposits of resources into repositories (see Resources for more information). SWORD is an acronym that stands for Simple Web-service Offering Repository Deposit. The standard was first developed in 2007 by a consortium of UK universities with funding from the UK's Joint Information Systems Council (JISC).

SWORD is a specialized profile of the AtomPub standard (see Resources) that provides a common protocol for creating web resources. The SWORD specification adds new extensions that allow it to fit with the requirements of repositories. These include the ability to perform a mediated deposit on behalf of another user, and to specify not only the MIME type of the file being deposited, but also the packaging format used to create the file being deposited.

AtomPub and SWORD interfaces provide two common elements in order to facilitate deposit:

  • Service Document: Each repository or AtomPub endpoint publishes a service document that describes to a user or client tool which areas of the repository or website they can deposit into, what the policies of that collection are, and the URL required to perform deposits.
  • Deposit URL: The deposit URLs described in the service document are used to accept deposits into the repository. Deposits may be accepted automatically or may be subject to administrative workflow. Responses to deposits are returned in the form of an Atom Document.

AtomPub is built around HTTP verbs, with GET being used to retrieve a service document, POST for creating new resources, PUT to update existing resources, and DELETE to remove resources.

Requests for service documents and deposits of new resources are typically controlled by an authentication mechanism. This ensures that the service document only lists the collections into which a user can deposit items, and that the repository deposit URL knows who is making the deposit and ensures that they have the authorization to do so. SWORD interfaces typically use HTTP basic authentication.

In contrast to AtomPub where a deposit may be a simple file such as an image posted into a blog entry, repositories typically require a more complex deposit package containing descriptive metadata along with the file(s) to deposit. Presently, while many packaging formats exist they tend to be specific to each repository platform or to particular resource genres. There is no specific packaging format that all SWORD end-points must accept, which is sometimes cited as a barrier to the use of SWORD.

As users require web browsers to interact with web servers, SWORD clients are typically required to interact with repository SWORD end-points. SWORD clients are usually either custom built for a specific purpose or repository, or are more generic for use with any repository. Specific-purpose clients may be developed for very specialized purposes such as to allow automated laboratory equipment to deposit data files into a repository. Examples of more generic clients include a Facebook client for depositing from within Facebook and posting details of the deposit onto a user's news feed.

Repositories may be quite specific about the types of resources that they will accept for deposit, and these requirements are described by the service document. Listing 2 shows an example response to the request for a service document. In this example, there is only one collection available into which the user may deposit, and that collection will only accept deposits in the form of packages made up of a ZIP file containing a METS metadata manifest along with the files.

Listing 2. An example SWORD service document
<?xml version="1.0" encoding='utf-8'?>
     <service xmlns=""          
               <atom:title>Main Site</atom:title>
               <atom:title>Biological image library</atom:title>
               <dcterms:abstract>This is a collection that allows deposits 
into the collection of biological images.</dcterms:abstract>       
               <sword:treatment>Images deposited into this collection will be 
converted into JPEG2000 format upon ingest.</sword:treatment>      

Listing 3 and Listing 4 show typical deposit requests and responses. The request is to deposit a package into a specific collection, and the response details what the identifier of the created item is and echoes back details about the item.

Listing 3. An example SWORD deposit HTTP request header
POST /sword/deposit-bio-images HTTP/1.1 
Content-Type: application/zip 
User-Agent: SWORD client XYZ 
Authorization: Basic
Content-Length: 47423

Listing 4 shows an example of a SWORD deposit response.

Listing 4. An example of a SWORD deposit response
HTTP/1.1 201 
Created Date: Mon, 4 October 2010 18:00:00
Content-Length: 2434 
Content-Type: application/atom+xml; charset="utf-8" 

<?xml version="1.0"?>
     <entry xmlns=""       
          <title>My Deposit</title>
               <name>Stuart Lewis</name>
          <content type="text/html"         
          <link rel="edit-media" href="http://"/>    
          <link rel="edit"
href="" /> 
          <sword:userAgent>SWORD client XYZ</sword:userAgent>           

The SWORD protocol has evolved since its inception through the development of subsequent versions of the protocol. The current version of the standard is 1.3, and in 2010 further funding from JISC has enabled the start of an initiative to develop a new major version of the standard.


Open repositories are starting to make an impact on the world of scholarly communication. Through the use of open standards for interoperability, new tools and systems are being created that allow researchers to get their research into open repositories, giving their work more visibility than ever before. Researchers and non-researchers alike are able to more easily find and access articles about their chosen subject. The open access movement is seeking to allow the taxpayer to have free and immediate access to the results of the research that their taxes have funded.

Change takes time, and in an environment such as scholarly communication that has roots going back hundreds of years, it is understandable that scholars may be wary of these changes. Managers of open repositories have had mixed success in populating their repositories. Some, such as, are a resounding success, while others have found it difficult to persuade their researchers to deposit research articles. However, the growth data and repository maps show that open repositories are now mainstream. Commercial publishers are having to create open-access-friendly policies, and there are stable and mature software packages to provide repository platforms.

Such openness of research and research data can only be a good thing, speeding up the discovery of new and world-changing technologies.

John Casey (Digitalinsite) and Gareth Waller (AGW Software) developed the original outlines for the series of articles.



Get products and technologies

  • Innovate your next open source development project with IBM trial software, available for download or on DVD.


  • Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.


developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Open source on developerWorks

Zone=Open source, XML
ArticleTitle=Technical standards in education, Part 3: Open repositories for scholarly communication