Life Sciences Identifier (LSID) is a new naming standard and data-access protocol being developed in the Interoperable Informatics Infrastructure Consortium (I3C.org) along with help from IBM and other technology organizations such as Oracle, Sun Microsystems, and the Massachusetts Institute of Technology. A client application resolves an LSID against a special server called an authority to discover data and information about the data (metadata).
The admittedly idealistic goal of LSID Resolution is for all biotech, pharmaceutical, and other life sciences organizations to build LSID Resolution Services in front of their data. With a common standard for data retrieval, scientists across these organizations may then easily share data, facilitating collaboration on such vital projects as drug discovery and disease research. The LSID Server Framework enables this LSID utopia by allowing organizations to provide their data using a service implementation that best matches their data source. Certain data sources will require only mapping from LSID to URL, if each piece of data has a URL that can retrieve it in a standard format. If the data source is a relational database, a more complex service will need to be written.
This article shows how to build Resolution Services using Java 2 Enterprise Edition (J2EE)-based components. We'll look at the LSID Client Stack, which provides LSID connectivity within applications; the LSID Server Framework, which enables rapid development and deployment of LSID Resolution Services; and select resolution service implementations. Finally, we'll see how enterprises might integrate these components to form an Enterprise LSID Resolution Network. The figures in this article illustrate the architecture of individual components as well as the interaction between them. The red text and arrows show the involvement of key Java classes.
The LSID Client Stack is a simple yet crucial component of the LSID Resolution Network. It allows life sciences applications in the network to easily consume data provided via LSID. In addition to Java, the LSID Client Stack has C++/COM and Perl implementations, allowing integration with virtually any application. The three APIs expose the same functionality using programming methodologies and design patterns specific to the host language. Figure 1 shows the LSID Client Stack embedded in an application.
Figure 1. The LSID Client Stack architecture
Given an LSID of the form urn:lsid:authority:namespace:object, the stack resolves the actual host of "authority" first against a local list of authority endpoint URLs and then against DNS if necessary. When requested, the stack retrieves the WSDL document containing the data and metadata locations via the
getAvailableOperations Web service call against the resolved authority. It then parses metadata and data locations from the WSDL and uses them for retrieval via HTTP, FTP, or SOAP. These locations do not need to be on the same server as the authority, and in general they will not be. The user may choose a specific location, specify only a certain protocol, or allow the stack to perform the entire selection. This flexibility abstracts the WSDL from the user, but provides the user with more granular control if necessary.
The client stack contains a file-based caching module that drastically improves response times for repeated data, metadata, and WSDL requests. When the user makes a request, the client stack first checks the file cache for a response before going over the network. After a network request has been completed, the stack writes the response to the cache. The lifecycle of a cache item is governed by local policy and response expiration. The local policy defines how long an item may live and the maximum size of a cache directory. In addition, metadata services and authorities may return expiration headers to advise the client of the expiration of metadata and WSDL. The data itself is immutable, per the LSID specification, so cached data never expires but may be removed by local cache policy enforcement.
Building LSID Resolution Services from scratch can be difficult and time consuming. Furthermore, the work of parsing requests and marshalling responses is common to all implementations. The LSID Server Framework facilitates development of LSID services by providing common protocol handling on the server and separating the LSID protocol from service implementation.
The system has three components: the Authority Service, the Data Service, and the Metadata Service. Each component defines a Java interface containing the applicable methods. In the case of the Authority Service, the methods are
getKnownURIs returns a list or LSIDs that the authority knows about.
getAvailableOperations returns a WSDL document describing locations at which data and metadata for the LSID may be retrieved. Note that the third authority operation,
getAuthorityVersion, is not included in this interface, because the authority version refers to the version of the protocol. The protocol version is abstracted from the service implementation, because it is based on hidden details such as HTTP/SOAP headers and SOAP return types. The Data Service and the Metadata Service interfaces contain methods for retrieving data (
getData) and metadata (
Each service is driven by a corresponding HTTP servlet:
MetaDataServlet. Each servlet parses the method, arguments, and target LSID (which may be a SOAP parameter or part of the URL) from a request. Using the target LSID, the servlet looks in the service registry to determine which configured service implementation to invoke. The lookup procedure compares the authority and namespace components of the LSID to mappings in the registry. For example, urn:lsid:pdb.org:pdb:1aft might be handled by
PDBAuthorityImpl, whereas urn:swiss-prot.org:swiss-id:hv20_mouse-sprot might be handled by
AuthorityServlet exposes its operations only by HTTP SOAP. For
getAuthorityVersion, the servlet immediately returns the version of the protocol it is using. In the current release, the protocol version is 3. For
getKnownURIs, the servlet invokes
getKnownURIs on each authority implementation for which a mapping is registered, to provide a complete list of all LSIDs it knows about. For
getAvailableOperations, the servlet uses the LSID argument to look up the authority implementation from which to retrieve a WSDL of data and metadata locations.
DataServlet exposes its single
getData operation via HTTP
Get and HTTP SOAP. For SOAP, the LSID is contained in the single SOAP parameter. For HTTP
Get, the LSID is expected in the query string, for example: http://www.myappserver.com/lsid/data?lsid=urn:lsid:foo:bar. The
MetadataServlet works the same way but has an optional argument that the metadata service implementation can use as a hint to retrieve the metadata. For SOAP, this argument is embedded in the URL; for example: http://www.myappserver.com/lsid/metadata/metadata-hint. For HTTP
Get, we use two request parameters for the LSID and the hint; for example: http://www.myappserver.com/lsid/metadata?lsid=urn:lsid:foo:bar&hint=metadata-hint.
A given LSID Resolution Service need only consist of an Authority Service. The Data and Metadata Services are, in a sense, only utilities for providing data and metadata endpoints. A given data provider, for example, might already have convenient HTTP locations for the data, which can be referenced directly in the
Security in the LSID Resolution Network is handled at the protocol level. For HTTP services, HTTP Basic Authentication is used. For SOAP services, authentication for the underlying transport protocol is used (HTTP, in current implementations). The LSID Client Stack and Server Framework handle these two cases.
The following examples of LSID Resolution Services show how the LSID Server Framework can be used.
The Caching Proxy Resolution Service -- the LSID analogue of an HTTP proxy -- is a server that sits on the edge of a network (lab, department, organization, etc.), and proxies all LSID traffic. Furthermore, the proxy can cache all of its requests so that scientists using the same LSID working set will experience rapid respond time to requests. The Caching Proxy Resolution Service also serves as a method to monitor LSID traffic in an organization. Figure 2 shows the Caching Proxy Resolution Service.
Figure 2. The Caching Proxy Resolution Service architecture
The Caching Proxy Resolution Service uses the client stack to proxy requests to other services. The caching functionality of the client enables the proxy to respond quickly to requests it has cached. The caching proxy can process any LSID that is resolvable via DNS, so its list of known URIs is technically the global space of LSIDs. However, to pare this down, we return only LSIDs for which we have WSDL cached when
getKnownURIs is invoked.
The Caching Proxy Resolution Service is composed of an Authority Service, a Data Service, and a Metadata Service. For
getAvailableOperations, the proxy uses the client stack to call
getAvailableOperations itself and builds another WSDL based on the response it receives. This new WSDL contains locations of data and metadata services in the proxy. When the proxy receives a request for
getData, it makes a request to an arbitrary data location from the original WSDL, since the data itself is identical across all locations. However, each metadata location may contain different metadata, and so each original location must be exposed through the proxy. We encode the metadata port name in the hint in the URL we return in our WSDL. Thus, when we receive a request for
getMetaData, we can relay the call to a specific metadata location.
The Gateway Resolution Service provides an XML-based language to explicitly describe the behavior of the Authority Service called Authority Service Description Language, or ASDL. An ASDL document contains a list of available LSIDs and their corresponding data and metadata locations. Currently, this document should be auto-generated from a relational database or flat-file store. A version is in development that allows mappings to be specified via Java-based regular expressions so that an entire authority may be described in a few lines of hand-written XML.
The gateway has two use cases. The first, illustrated in Figure 3, is to provide an LSID-based view of a local data store. The developer must provide data and metadata service implementations that scrape the local data store. For example, these implementations might utilize JDBC to read the tables of a relational database. The entries in the ASDL file will reference the location of the
Figure 3. The Local Gateway Resolution Service
The second use case, shown in Figure 4, is perhaps more powerful. If a life sciences data provider exposes data by static or dynamic URLs, as many do, a third-party developer (with permission of course) may create an ASDL document that assigns virtual LSIDs to the data. The data and metadata locations in the ASDL will point to pre-existing URLs.
Figure 4. The Remote Gateway Resolution Service
In practice, a combined approach may be used for providing third-party data. The ASDL file may contain both URLs pointing to the original data source and URLs pointing to a Data Service and/or a Metadata Service. In general, these services will relay the requests to the original data source URLs. This architecture will ensure that the data is provided over both SOAP and HTTP. This might be necessary in case the data provider allows only FTP access. FTP is not likely to be supported by all clients. However, the relays may need to do more than bridge two protocols. For example, the original URLs of the third-party data source may reference formatted HTML pages. The relay services might have to scrape the actual data from these pages in order to provide it in a standard format.
If ASDL is not descriptive enough to completely describe data and metadata mappings, a developer may provide a custom implementation as illustrated in Figure 5.
Figure 5. Architecture of a Custom Resolution Service
The most involved aspect of building an Authority Service is creating the WSDL in response to
getAvailableOperations. The server framework provides simpler interfaces via the abstract classes
SimpleAuthority, the developer need only implement methods that return the locations of metadata and data for a given LSID. This information is used to construct WSDL. In Figure 5,
LSIDAuthorityImpl could be written to extend
SimpleResolutionService provides a further abstraction for the common use case in which the Metadata Service and Data Service are hosted together with the Authority Service.
LSIDDataServiceImpl could be merged into a single class that extends
SimpleResolutionService. This derived class is also an Authority Service implementation that directs data and meta data requests to itself.
An example Custom Resolution Service is the Protein Data Bank (PDB) Authority. The PDB authority returns a mix of HTTP and FTP data locations, as well as SOAP endpoints that proxy data from those locations. The PDB authority also offers a metadata service that generates comprehensive RDF that relates LSIDs to each other and provides links to external resources.
Figure 6 shows the architecture of the PDB Authority. For convenience, a complete resolution service is often referred to simply as an authority. The WSDL returned by this authority provides both FTP and HTTP direct data locations and SOAP data locations via the PDB Data Service.
Figure 6. The Protein Database Authority
Because many authority implementations may be hosted by a single servlet, a host of hybrid authorities are possible. Consider a Caching Proxy Resolution Service that was also the direct authority for a certain set of LSIDs. From the perspective of a client application in a biology lab, the data discovery and retrieval process will appear uniform and seamless, regardless of whether or not the service had to go outside to resolve an LSID. Such a service could be used as the central authority for an enterprise, such as a pharmaceutical company, where researchers needed to integrate data from internal as well as external sources. As the service handles more requests for external LSIDs, the cache grows, allowing increasingly rapid access to external data.
Figure 7 illustrates this Hybrid Resolution Service. The curved arrows show how the servlets dispatch requests to different service implementations. All LSIDs with authority
myauth will be handled by the Local LSID Services. All other LSIDs will be handled by the Caching Proxy.
Figure 7. Hybrid Resolution Service architecture
The LSID utopia described at the beginning of this article could be realized if each organization builds an Enterprise Resolution Network based on the Hybrid Caching Resolution Service with the assumption that the internal services are available to the outside world as well. Consider two research labs, each with such a network. Client applications in both laboratories have the same view of the federated data. In addition to this symmetry, the client applications can access external LSID-based data sources such as PDB and Gateway-based services. Independent external users may also access the resolution network with a client application. These relationships are illustrated in Figure 8.
Figure 8. Enterprise LSID Resolution Network
Utopic visions aside, LSID Resolution will become more useful when more providers use LSIDs to expose their data. LSID Resolution Services exist for many well-known life sciences data sources including PDB, Genbank, Pubmed, Swissprot, GeneOntology, Locuslink, and Ensembl. Any organization with a bit of work or outsourcing could provide their existing databases via LSID.
The remaining question to be answered is, how do scientists and researchers dynamically assign LSIDs to their personal data, results, papers, lab reports, and other documents? This capability is a prerequisite to fully achieving the level of collaboration and data federation enabled by the LSID Resolution Network. The LSID development team in Cambridge, Mass., is currently working to solve this new problem.
- Download server, client, and sample resolution service code from the LSID project summary page on developerWorks.
- For more information about the LSID project, visit the LSID Web site.
- "Build an LSID authority on Linux" provides detailed instructions for building an LSID Resolution Service.
- For information on Resource Description Framework metadata format, visit RDF.
- For information on Protein Data Bank, visit pdb.org.
- For information on Genbank, Pubmed, and Locuslink, visit the National Center for Biotechnology Information home page.
- For information on Ensembl, visit the Ensembl project page.
- For information on Swiss-Prot, visit the Swiss-Prot page on the ExPASy Molecular Biology Server.
- For information on Gene Ontology, visit the Gene Ontology Consortium home page.
- developerWorks has published two articles on using open source software in the biosciences and in the laboratory in general. Open source in the biosciences looks largely at the bioinformatics landscape, while Open source in the lab looks at Perl, Python, and open source toolkits for data analysis.
- Find more resources on open source projects, Java technology, and Web services on developerWorks.