Research is the endeavor of seeking new knowledge and improvements in life and society. It covers subjects as diverse as physics or history, genetics or kinetics, the desire to create new construction materials or to understand the human mind. The world of scholarly communication sits at the heart of research. It allows researchers to disseminate their findings to others through the use of published journal articles, books, conferences, or new media formats. Without this dissemination, the value of the research could be unrealized.
These methods of dissemination have powered research and research institutions such as universities, government laboratories, and charitable research trusts for hundreds of years. Royal Societies were some of the first to publish collections of research, followed by academic societies.
With over $300 billion spent on research and development annually in the U.S. alone (see Resources), the effective dissemination of research and research data is of increasing importance to those who fund the research.
One important element of the scholarly communication system is the use of peer-review. Before a paper is published, it is often subjected to peer-review where it is independently and anonymously scrutinized by a panel of experts in the field. This ensures the research findings are rigorous and relevant. Research outputs that have been subjected to peer-review are naturally held in higher esteem than those that are not.
But is the world of scholarly communication performing as well as it could?
Open repositories, shared standards, and common interoperability protocols are required to collect, preserve, and provide access to scholarly research. The key word in all of these aspects is open. Through the use of open standards and open systems, the scholarly communication world can operate online in more powerful ways than ever before, allowing research to become more accessible and reusable.
One of the buzzwords of the past few years is open:
- open platforms
- open architectures
- open standards
- open source
It is the same with scholarly communication. A plethora of open philosophies, systems, and standards have evolved, that many people believe can improve the way the system works.
While existing for many years, the open access movement has taken off over the past 15 years. As the introduction of the printing press allowed knowledge to be spread more rapidly than at any time previously, the rapid development and uptake of the World Wide Web holds the possibility of a similarly large impact on the way research findings are disseminated. However, many people in the field of research believe that this potential is not being realized because research is often locked up inside publications.
Research is often published through journals or periodicals that require subscription to access and read the contents. Libraries within research institutions spend large amounts of money on subscriptions to journals so that their researchers can have access to the published research findings. Therefore the research institution pays twice to access the research:
- The institution pays the researcher to undertake and to publish the research.
- The institution pays the publisher to access the published copy of the research.
As much of the research is funded either by the taxpayer via grants from government-funded research programs or by charitable research trusts, many believe that it is unfair, or even wrong, that the research must be paid for twice, with commercial publishers making profits from the system, and the taxpayer not having free access.
This belief spawned the notion of open access publishing. Research can either be self-archived for free by researchers online in open repositories, or a fee can be paid to the publisher to make the research outputs freely available without the need for a subscription.
Understandably there are disagreements within the research and publishing communities about open access and whether it is a good thing, how it should be funded, and whether it should be compulsory. However, many research funders think that funded research should be available for free and require the outputs of any research that they fund to be available in this way (see Resources).
Whatever your beliefs in this area, there are interesting technical challenges relating to collecting, storing, preserving, transferring, and providing access to research in an open fashion. Furthermore, over the past decade interest unlocking the research that is often hidden in unpublished works such as electronic theses and dissertations, data sets, and the so-called gray-literature has grown. By placing these research works online the research can be unlocked for access and use.
Open repositories were born from a desire to make research materials available online. In essence, an open repository is simply a database-driven website of files and descriptive metadata, but, to perform effectively, they usually are able to:
- Ingest and store materials.
- Accurately describe contained materials.
- Manage materials and their descriptions.
- Preserve materials over a long period of time.
- Disseminate stored materials.
These abilities are summarized in the Reference Model for an Open Archival Information System (OAIS) ISO standard (see Resources). In this model, items are created by producers, used by consumers, and managed by the system. Open repositories provide these three core functions.
To describe their contents, repositories make use of metadata, that is, data about the data contained. There are several types of metadata used for different purposes. The most obvious is descriptive metadata that provides a description of the item itself. This typically includes fields such as title, creator, date, description, or publisher. Metadata needs to be encoded within an open standard to ensure that it can be read and understood. Commonly-used standards for descriptive metadata include Dublin Core and MODS (Metadata Objects Description Schema) (see Resources).
In addition to descriptive metadata, other forms of metadata can be stored. For example, Preservation metadata is intended to assist with preserving the item over a long period of time. Preservation metadata might include the file format of the item, the version of software required to view the item, the size of the item, the checksum of the item's files, and the software used to create the item.
As well as requiring an accurate description of an item, preservation depends on other activities that need to be undertaken by a repository. Some of these activities are simple, such as the regular checking of the file checksums of items to ensure that they are not suffering from bit-rot. Other activities may be more complex, such as migrating file formats when old formats become obsolete, or performing automatic file identification to classify file types.
For the dissemination of items, an important role played by the repository is the provision of persistent URLs. Persistent URLs are designed to ensure that anyone citing an item by its URL will be able to retrieve the item many years later by using the same URL. The provision of persistent URLs comes from two layers. The first and most important layer is the use of identifiers that uniquely identify a single item in the repository. An optional second layer employed by many repositories is the use of a third party persistent identifier service. For example, the DSpace open source repository platform typically makes use of the Corporation for National Research Initiatives' (CNRI) "Handle" system, although there are alternatives such as the Persistent-URL (PURL) system. These services work by employing an extra level of indirection. The PURL points to a third-party domain that in turn resolves the URL to the repository URL.
redirects users to
The persistent URL is made up of three parts:
http://hdl.handle.net/— The URL of the persistent URL service
/2292/— The identifier of the repository
5315— The identifier of the item within the repository
The aim of the persistent URL services is that if the repository changes in any way, for example, the software changes, the domain name changes, or the identifiers change, then the persistent URL handler can be updated to ensure that it redirects users to the new URLs.
The history of open repositories
Arguably, the first and most renowned open repository is the arXiv.org (pronounced "archive") repository of scientific pre-prints (see Resources). It was created in 1991 and falls within the classification of a subject repository, as it only holds items related to physics, math, computer science, and related subjects. arXiv.org holds over half a million pre-prints, which are articles that have been written but have not yet been subjected to peer-review or formally accepted for publication in a traditional journal or conference. In fast-moving research environments, researchers want to share their work in this way to circumvent the time it takes for traditional publishing to take place.
There are several other noted subject repositories for different disciplines. These include RePEc (Research Papers in Economics) and E-LIS (Eprints in Library and Information Science). See the Resources section at the end of this article for links.
Following wider interest in creating open repositories, an open source repository platform called EPrints was created in 2000. This software, developed at the University of Southampton School of Electronics and Computer Science, is written in Perl and is still one of the leading platforms used for creating open repositories.
In 2002, a collaboration between Hewlett Packard Research Laboratories and MIT released the open source DSpace repository platform. DSpace was developed in Java and JSPs, while more recent versions also include a Cocoon and XSLT user interface.
Another major player in the open source repository platform world is Flexible Extensible Digital Object Repository Architecture (Fedora), which was originally developed by Cornell University's Digital Library Research Group. Fedora differs from EPrints and DSpace by not having a full end-user interface included with the core platform. Consequently there are several projects or groups that develop and support different user interfaces.
A more recent entrant to the open source repository arena is Microsoft® with their Zentity repository in 2008. The platform is built on the Microsoft technology stack including .Net and SQL Server.
The DSpace, EPrints, Fedora, and Zentity open source repositories are well maintained and backed by foundations or commercial services. In 2007, the DSpace Foundation and Fedora Commons were created as not-for-profit organizations to ensure the ongoing development and sustainability of these platforms. In 2009, the DSpace Foundation and Fedora Commons merged to create the Duraspace organization and are seeking ways to enable the two platforms to work closely together. EPrints runs a commercial service providing hosting, development, customization, and integration of their software.
In addition to the open source repository platforms, there are a number of commercial options. The largest provider of a commercial open repository system is BEPress with their Digital Commons product. There are hosted solutions based on the open source platforms, such as BioMed Central's Open Repository based on DSpace, and EPrint Services' hosted EPrints system.
The majority of the EPrints, DSpace, and Fedora installations could be classified as institutional repositories. These are typically provided by a research institute, university, or department for use by their researchers.
Another type of repository that is becoming more popular is the learning object repository. These hold copies of learning objects, or modules of information that can teach a particular skill or subject. Universities put a lot of effort into creating learning materials, and as with research, there are many good arguments for sharing these rather than keeping them locked up in a particular institution. Good examples of learning object or courseware repositories include MITs OpenCourseware site or Apple's iTunes U.
There are two websites that track the growth of open repositories:
- ROAR: Repository of Open Access Repositories
- OpenDOAR: Open Directory of Open Access Repositories
Figure 1 is a graph from ROAR that shows how the number of repositories has grown from humble beginnings in the 1990s, to a rapid increase over the last decade.
Figure 1. The growth of repositories and the records they contain (from ROAR)
Figure 2 shows the map from the Repository66 website, which is a mashup of data from ROAR and OpenDOAR, and a Google map to display the distribution and type of repositories across the globe. It currently displays over 1,650 repositories, which contain more than 27 million items.
Figure 2. The distribution and types of repositories across the world according to repository66.org
To operate effectively together, open repositories require open standards. There are open standards for each of the common operations that can be undertaken with repositories. Such operations include harvesting, searching, depositing, authentication, and describing contents. Two particular open standards that have become core to repository interoperability are Open Archives Initiative's Protocol for Metadata Harvesting and SWORD.
Open Archives Initiate Protocol for Metadata Harvesting
The oldest and most widely known of the repository-related open standards is the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH). This protocol allows the contents of open repositories to be harvested by other systems. OAI-PMH provides a method by which larger aggregating search systems can provide a federated search service across multiple repositories. Rather than having to rely on screen-scraping, OAI-PMH allows search providers to harvest the raw structured metadata from repositories. This can yield more powerful search mechanisms as data is harvested in specific fields such as title, creator, abstract, and keywords.
OAI-PMH is an XML-based protocol that is based on a number of verbs. These verbs, along with extra arguments, are used to instruct the repository to describe its contents.
Identify: The repository can provide identity information about itself, including attributes such as its name, URL, contact email address and what options the interface supports.
ListMetadataFormats: OAI-PMH interfaces can expose the metadata of the contained items in different metadata formats or standards. This verb will list the metadata formats that the repository supports. Requests using other verbs can then specify the metadata format.
ListSets: A repository can partition its items into sets. A set can be analogous to a particular collection. This is useful if only a subset of a repository needs to be harvested. The
ListSetsverb lists all of the sets contained within the repository. Requests using other verbs can state which set to harvest.
ListRecords: This verb provides one of the main ways to harvest data. It will list all of the records in the repository that conform to the parameters passed. The metadataPrefix (containing a value from the ListMetadataFormats response) value must be given to state which metadata format the metadata should be expressed in. Optional parameters can be used to refine the harvest, including from dates and until dates, and a particular set. Harvesters will typically perform a full harvest, and then incremental harvests periodically, making use of the
fromparameter to only harvest recently added items.
GetRecord: It is possible to retrieve a single item using OAI-PMH by using this verb and specifying the identifier of the item to retrieve.
ListIdentifiers: This verb is identical to the ListRecords verb, except that only the identifiers of matching records are returned rather than their full records. This method is sometimes used if a harvester wants to harvest items individually. They can first get a list of items to harvest, and then retrieve each one individually using the GetRecord verb.
Listing 1 shows an example response of an OAI-PMH identifier. The response was retrieved from the following URL:
The code shows the two sections of a typical OAI-PMH response. The first is the header, and echoes back details of the request, and the time that is was made. The second section shows the response of the requested action. In this case it is a single record, split into its header that describes the item (its identifier, when it was last modified, and which sets it belongs to) and the metadata in the requested oai_dc format.
Listing 1. An example OAI-PMH response to a GetRecord request
<?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2010-10-09T07:55:12Z</responseDate> <request identifier="oai:researchspace.auckland.ac.nz:2292/5315" metadataPrefix="oai_dc" verb="GetRecord"> http://researchspace.auckland.ac.nz/dspace-oai/request </request> <GetRecord> <record> <header> <identifier>oai:researchspace.auckland.ac.nz:2292/5315</identifier> <datestamp>2009-10-13T11:31:12Z</datestamp> <setSpec>hdl_2292_125</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>If SWORD is the answer, what is the question? Use of the Simple Web-service Offering Repository Deposit protocol</dc:title> <dc:creator>Lewis, Stuart</dc:creator> <dc:creator>Hayes, Leonie</dc:creator> <dc:creator>Newton-Wade, Vanessa</dc:creator> <dc:creator>Corfield, Antony</dc:creator> <dc:creator>Davis, Richard</dc:creator> <dc:creator>Wilson, Scott</dc:creator> <dc:description>Purpose - To describe the repository deposit protocol, Simple Web-service Offering Repository Deposit (SWORD), its development iteration, and some of its potential use cases. In addition, seven case studies of institutional use of SWORD are provided. Approach - The paper describes the recent development cycle of the SWORD standard, with issues being identified and overcome with a subsequent version. Use cases and case studies of the new standard in action are included to demonstrate the wide range of practical uses of the SWORD standard. </dc:description> <dc:publisher>Emerald</dc:publisher> <dc:date>2009</dc:date> <dc:type>Journal Article</dc:type> <dc:identifier>Program: electronic library and information systems 43 (4), 407-418. (2009)</dc:identifier> <dc:identifier> 0033-0337</dc:identifier> <dc:identifier>http://hdl.handle.net/2292/5315</dc:identifier> <dc:identifier>10.1108/00330330910998057</dc:identifier> <dc:language>en</dc:language> <dc:relation>Program: electronic library and information systems</dc:relation> <dc:rights>Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher.</dc:rights> <dc:rights>http://researchspace.auckland.ac.nz/docs/uoa-docs /rights.htm</dc:rights> </oai_dc:dc> </metadata> </record> </GetRecord> </OAI-PMH>
One of the useful features from a development perspective when harvesting from OAI-PMH interfaces is that it is mandatory for a repository to be able to return metadata for all items in the oai_dc format. This format returns unqualified Dublin Core metadata. Dublin core is a relatively simple metadata schema made up of 15 elements including title, creator, description, and date. This ensures that a harvester can harvest content from any repository, as it is mandatory for this format to be supported.
While OAI-PMH offers a standardized way to harvest the contents of repositories, SWORD offers a standardized way to perform deposits of resources into repositories (see Resources for more information). SWORD is an acronym that stands for Simple Web-service Offering Repository Deposit. The standard was first developed in 2007 by a consortium of UK universities with funding from the UK's Joint Information Systems Council (JISC).
SWORD is a specialized profile of the AtomPub standard (see Resources) that provides a common protocol for creating web resources. The SWORD specification adds new extensions that allow it to fit with the requirements of repositories. These include the ability to perform a mediated deposit on behalf of another user, and to specify not only the MIME type of the file being deposited, but also the packaging format used to create the file being deposited.
AtomPub and SWORD interfaces provide two common elements in order to facilitate deposit:
- Service Document: Each repository or AtomPub endpoint publishes a service document that describes to a user or client tool which areas of the repository or website they can deposit into, what the policies of that collection are, and the URL required to perform deposits.
- Deposit URL: The deposit URLs described in the service document are used to accept deposits into the repository. Deposits may be accepted automatically or may be subject to administrative workflow. Responses to deposits are returned in the form of an Atom Document.
AtomPub is built around HTTP verbs, with
being used to retrieve a service document,
for creating new resources,
PUT to update
existing resources, and
DELETE to remove
Requests for service documents and deposits of new resources are typically controlled by an authentication mechanism. This ensures that the service document only lists the collections into which a user can deposit items, and that the repository deposit URL knows who is making the deposit and ensures that they have the authorization to do so. SWORD interfaces typically use HTTP basic authentication.
In contrast to AtomPub where a deposit may be a simple file such as an image posted into a blog entry, repositories typically require a more complex deposit package containing descriptive metadata along with the file(s) to deposit. Presently, while many packaging formats exist they tend to be specific to each repository platform or to particular resource genres. There is no specific packaging format that all SWORD end-points must accept, which is sometimes cited as a barrier to the use of SWORD.
As users require web browsers to interact with web servers, SWORD clients are typically required to interact with repository SWORD end-points. SWORD clients are usually either custom built for a specific purpose or repository, or are more generic for use with any repository. Specific-purpose clients may be developed for very specialized purposes such as to allow automated laboratory equipment to deposit data files into a repository. Examples of more generic clients include a Facebook client for depositing from within Facebook and posting details of the deposit onto a user's news feed.
Repositories may be quite specific about the types of resources that they will accept for deposit, and these requirements are described by the service document. Listing 2 shows an example response to the request for a service document. In this example, there is only one collection available into which the user may deposit, and that collection will only accept deposits in the form of packages made up of a ZIP file containing a METS metadata manifest along with the files.
Listing 2. An example SWORD service document
<?xml version="1.0" encoding='utf-8'?> <service xmlns="http://www.w3.org/2007/app" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/" xmlns:dcterms="http://purl.org/dc/terms/"> <sword:level>1</sword:level> <sword:verbose>true</sword:verbose> <sword:noOp>true</sword:noOp> <workspace> <atom:title>Main Site</atom:title> <collection href="http://repository.example.com/sword/deposit-bio-images"> <atom:title>Biological image library</atom:title> <accept>application/zip</accept> <dcterms:abstract>This is a collection that allows deposits into the collection of biological images.</dcterms:abstract> <sword:mediation>true</sword:mediation> <sword:treatment>Images deposited into this collection will be converted into JPEG2000 format upon ingest.</sword:treatment> <sword:packaging> http://purl.org/net/sword-types/METSDSpaceSIP </sword:packaging> </collection> </workspace> </service>
Listing 3 and Listing 4 show typical deposit requests and responses. The request is to deposit a package into a specific collection, and the response details what the identifier of the created item is and echoes back details about the item.
Listing 3. An example SWORD deposit HTTP request header
POST /sword/deposit-bio-images HTTP/1.1 Host: repository.example.com Content-Type: application/zip User-Agent: SWORD client XYZ Authorization: Basic Content-Length: 47423
Listing 4 shows an example of a SWORD deposit response.
Listing 4. An example of a SWORD deposit response
HTTP/1.1 201 Created Date: Mon, 4 October 2010 18:00:00 Content-Length: 2434 Content-Type: application/atom+xml; charset="utf-8" Location: http://repository.example.com/sword/deposit-bio-images <?xml version="1.0"?> <entry xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/"> <title>My Deposit</title> <id>info:something:1</id> <updated>2008-08-18T14:27:08Z</updated> <author> <name>Stuart Lewis</name> </author> <content type="text/html" src="http://repository.example.com/sword/deposit-bio-images/167"/> <link rel="edit-media" href="http:// repository.example.com/sword/deposit-bio-images/167/package.zip"/> <link rel="edit" href="http://www.myrepository.ac.uk/sword/deposit-bio-images/167.atom" /> <sword:userAgent>SWORD client XYZ</sword:userAgent> </entry>
The SWORD protocol has evolved since its inception through the development of subsequent versions of the protocol. The current version of the standard is 1.3, and in 2010 further funding from JISC has enabled the start of an initiative to develop a new major version of the standard.
Open repositories are starting to make an impact on the world of scholarly communication. Through the use of open standards for interoperability, new tools and systems are being created that allow researchers to get their research into open repositories, giving their work more visibility than ever before. Researchers and non-researchers alike are able to more easily find and access articles about their chosen subject. The open access movement is seeking to allow the taxpayer to have free and immediate access to the results of the research that their taxes have funded.
Change takes time, and in an environment such as scholarly communication that has roots going back hundreds of years, it is understandable that scholars may be wary of these changes. Managers of open repositories have had mixed success in populating their repositories. Some, such as arXiv.org, are a resounding success, while others have found it difficult to persuade their researchers to deposit research articles. However, the growth data and repository maps show that open repositories are now mainstream. Commercial publishers are having to create open-access-friendly policies, and there are stable and mature software packages to provide repository platforms.
Such openness of research and research data can only be a good thing, speeding up the discovery of new and world-changing technologies.
- World Bank data page: Information on the amount of U.S. GDP spent on research and development in 2007 was calculated using data from the World Bank data page.
- SHERPA Juliet database: Visit the SHERPA Juliet database to learn about research funded open access policies.
- Open archival information system — Reference model. ISO 14721:2003 specifies a reference model for an open archival information system (OAIS). The purpose of this ISO 14721:2003 is to establish a system for archiving information, both digitalized and physical.
- Dublin Core metadata initiative is a commonly used standard for descriptive metadata.
- MODS is another commonly used standard for descriptive metadata.
- EPrints is an open source repository platform.
- DSpace open source software enables open sharing of content that spans organizations, continents and time.
- Fedora is another major player in the open source repository platform space.
- Zentity is a research output repository platform developed by Microsoft Research that provides a suite of building blocks, tools, and services to create and maintain an organization's digital library ecosystem.
- DuraSpace software and services are used worldwide as solutions for institutional repositories, open access publishing, digital libraries, digital archives, digital collections, data curation, virtual research environments, and more.
- BioMed Central is an STM (Science, Technology and Medicine) publisher which has pioneered the open access publishing model.
- The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) provides an application-independent interoperability framework based on metadata harvesting.
- Learn about the Simple Web-service Offering Repository Deposit (SWORD).
- "The Atom Publishing Protocol (AtomPub)", RFC 5023, is an application-level protocol for publishing and editing web resources.
- arXiv.org (pronounced "archive") is a repository of scientific pre-prints.
- Other well-known subject repositories for different disciplines include RePEc (Research Papers in Economics) and E-LIS (Eprints in Library and Information Science).
- Visit MITs OpenCourseware site or Apple's iTunes U.
- The aim of the Repository of Open Access Repositories (ROAR) is to promote the development of open access by providing timely information about the growth and status of repositories throughout the world.
- Open Directory of Open Access Repositories (OpenDOAR) is an authoritative directory of academic open access repositories.
- The Repository66 website is a mashup of data from ROAR and OpenDOAR, and a Google map to display the distribution and type of repositories across the globe.
- To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
- developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.
Get products and technologies
- Innovate your next open source development project with IBM trial software, available for download or on DVD.
- Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Dig deeper into Open source on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Keep up with the best and latest technical info to help you tackle your development challenges.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.