The task of uniquely naming biologically significant resources with LSIDs has fallen into your hands, since you are in charge of your project's data model. Many basic concepts of data modeling map well into LSIDs. However, there are some other factors that must be carefully examined. These include naming, caching, and the distinction between data and metadata. This article guides you through these aspects and discusses the current best practices in each topic.
The key to an easy-to-maintain system is a good grasp of your data model. While an LSID is defined to be semantically opaque, the author of an LSID resolution service must interpret the encoding to resolve and return the correct data. This section will describe some common data models and provide examples for exposing those using LSIDs.
An LSID consists of three scoping mechanisms: an authority, a namespace, and an identifier. It can also optionally contain a version, specified by a revision identifier. These parts are combined to create an LSID in the following form.
Listing 1. LSID format
The first step in creating a naming scheme for an LSID service is choosing an authority name. This is normally a top-level domain (TLD) registered to your organization, such as example.com. Since LSID resolution uses SRV records, your TLD does not have to point to the IP of your LSID server. In large organizations or for separate projects, you may wish to use a subdomain for the authority, such as department.example.com or project.example.com. Using separate authorities for projects can minimize the risk of collisions between entities of your data model and your namespace identifiers.
Now that an authority string has been chosen that is unique to your new LSID server, here is one possible mapping of an example data model into the LSID space. Suppose your data model contains the entities ResearchInstitute, Researcher, and Paper. A ResearchInstitute is identified by its ResearchInstitute ID, a Researcher is identified by its Researcher ID, and a Paper is identified by its Paper ID. This model is shown in Figure 1.
Figure 1. Academia entity-relationship model
The tables in a relational database would look similar to Figure 2.
Figure 2. Academia database tables
In this basic example, each entity has one unique identifier. The simplest way to expose this data model in LSID space would be to create one namespace per entity and use the entity's identifier as the LSID identifier.
Listing 2. Sample academia LSIDs
urn:lsid:example.com:researchinstitute:ibmresearch urn:lsid:example.com:researcher:alice123 urn:lsid:example.com:paper:132187
Notice that ResearcherPaper table is merely a many-to-many link and is not an entity in an entity-relationship diagram. There is no new information added to the model from this table, and the links it creates can be added to the metadata for Paper instances, Researcher instances, or both.
Revisions are an optional identifier that mark that a change has occurred. Suppose our researcher is named Alice. Alice has revised a research paper she had previously published. We can update the LSID revision to reflect this change, as shown in Listing 3.
Listing 3. Use of revision
A more complex example involved entities that do not have just one unique identifier. Suppose there are a group of several ResearchInstitutes, and each ResearchInstitute assigns the Researcher a SerialNumber unique only within the ResearchInstitute. It would be possible that a SerialNumber may not be unique across all researchers in different ResearchInstitutes. In your data model a Researcher must be identified by the ResearchInstitute and SerialNumber. This model is shown in Figure 3.
Figure 3. Complex academia entity-relationship model
Now you want to map this entity model into LSID space. There are several common solutions to map multiple identifiers. The first solution is to include both identifiers in the identifier portion of the LSID and use a separator character. This is shown in Listing 4.
Listing 4. Complex identifiers, encoding
In your authority implementation, you would break apart the LSID identifier and retrieve the correct researcher record from your database. You must ensure that the character you choose as a separator cannot be present in the first entity identifier. If this can be ensured, this is the preferred method.
A second solution is to give each researcher a unique ID in your data model. This can be done transparently from the user's perspective. Since LSIDs are opaque semantically, a user will not need to know you are using a unique ID and can work with your data as normal. This unique ID could be placed as a sequence or autonumber column directly with your data in the back-end store. Another option is to maintain a separate mapping that only your LSID authority uses. In this case, you would need to ensure that changes to the data are propagated to your mapping.
Listing 5. Complex identifiers, mapping
urn:lsid:example.com:researcher:a23b32o21wq Mapping table: +--------------------------------------------------------------+ | map.lsid_researcher | map.researcher | map.researchinstitute | +--------------------------------------------------------------+ | a23b32o21wq | 12345 | "ibmresearch" | +--------------------------------------------------------------+
These data model to LSID mappings are only a common practice. You are free to use the authority, namespace, and identifier scoping however they work best for your data.
LSID metadata is normally represented in an RDF serialization. LSIDs may be used in valid RDF syntax. To find out more about RDF, please see the W3 RDF site.
A key benefit of using LSID as a naming convention is the clear separation of data and metadata. This also gives the implementer of an LSID authority the task of determining what is data and what is metadata. The first thing to realize is that while most every LSID will have associated metadata, many LSIDs may not be associated with data.
Data is defined as a sequence of unchanging bytes. Examples of data are microscope images, a protein sequence, a text file, etc. Metadata is usually information that describes the data either literally (date created, MD5 check sum, size) or contains information describing the relationship between the data and other objects. The main point remains the same: The data of a specific LSID never changes, whereas the metadata may be updated.
So why could a protein sequence classified as data if it could just as easily be given designated by an RDF predicate in the metadata? The answer: It depends. This is a choice the implementer must make based on how she and her users will be using the system. When making your decision, keep in mind that the metadata will most likely be parsed into an RDF model, and data is delivered in a stream of bytes. If you cannot determine what should be data and what should be metadata from your data model, follow this rule of thumb: Large byte sequences are easier to manipulate as data, while short byte sequences can be included as data, metadata, or made available in both forms. This will be elaborated on more later in this article.
Including additional metadata for your LSID allows for your data model to be published into LSID space. This publication creates links between your individual LSIDs, and those of others form a graph of information capable of being interpreted by a machine. The question arises: What additional metadata information should be included for a specific LSID?
As with naming, the additional metadata for an LSID will most likely be derived from your data model. A data model consists of two types of descriptors: attributes and relationships. Both of these types of descriptors will go into your metadata. Attributes will most likely become RDF literals since they are primitive data types. Relationships will likely be references to other LSIDs if you created LSIDs for each entity. Several reasons for a strict subdivision of metadata by entity boundaries will be explained in the caching section. Let us take a look at the academia entity relationship model again.
Figure 1. Academia entity-relationship model
The metadata for ResearchInstitute instances contains three literal values: the institute's ID, name, and location. Each of these can be represented as RDF literals. A ResearchInstitute instance can also have many researchers. Links to the LSID of the Researcher instances would be included in the metadata as RDF resource links. The Researcher metadata is similar, containing the researcher's ID, fname, and lname. These would become RDF literals. The researcher's institute and papers would be added as a link to an RDF resource.
Notice that links have been added in both directions for each relationship. This may appear redundant to some RDF gurus, but it is necessary for LSID space. Since LSID metadata is distributed, you may only have a partial picture of the universe. For instance, Alice receives an LSID link to a fellow researcher. She would be able to resolve the outgoing LSID links to find her fellow's papers, but would be unable to identify which ResearchInstitute he works for. Metadata that published bidirectional links solves this problem. For this reason, it is recommended that links be made in both directions to allow for follow-up resolution.
LSID caching is one of the central concepts in the resolution scheme. Working with globally resolvable identifiers doesn't mean resolving them globally at all times. As the volume of resolutions increase, traffic volume and authority server load can be greatly reduced using techniques that take advantage of the temporal and network locality of the requests. Caching involves lookup, storage, and cleaning, and can be done to both data and metadata in three different ways: on the authority server, in a caching proxy, and in the client stack.
Before looking at each type of caching scenario, first examine the contents that will be stored in the caches. It is possible to cache both data and metadata calls for LSIDs. Data caching is very straightforward since it is defined as an unchanging sequence of bytes. Data in the cache is keyed on the entire LSID and can be stored in a cache indefinitely. "Indefinitely" is defined as the time it takes for your cache to reach its cleaning storage threshold.
Caching metadata is a more complex task since it is not guaranteed to be either unique or unchanging. A best-effort system is currently used that keys cached metadata based on the authority who distributed it and the LSID. A timeout value can be placed on the metadata by the issuing authority to give a cache an estimated time of accuracy. If the metadata is in a state of constant flux, the authority can specify that it should not be cached. More complex caching schemes for metadata such as pubsub can and have been deployed, but are out of the scope of this article.
It is important to think of metadata caching while you design the naming conventions for your data model. In certain instances, you may wish to include additional metadata about related entities in an LSID's resultant RDF metadata document. This is a dangerous practice since the metadata will be cached and subsequent updates to the related entities may not be properly propagated to clients. Since this can lead to metadata inconsistencies, it is important to only place metadata that is primarily about an LSID into its RDF metadata document. Related entities should be referenced with links to their respective LSIDs. With these concepts of data and metadata caching in mind, take a look at the three caching scenarios.
As the maintainer of an LSID authority, you want your server or server cluster to handle as many simultaneous requests as possible. Two likely bottlenecks in the request fulfillment process are the back-end store and converting metadata into RDF. An LSID authority can be backed by any conceivable storage device. Some common stores are a relational database management system (RDBMS) or Web services. While many programmers work tirelessly to improve performance of RDBMSes and Web services, they are still slow compared to all other aspects of the authority's task. As an authority creator interested in speed, you want to touch the back end as little as possible.
The best way to decrease back-end delay and increase speed of execution is using authority caching. This functionality is already included in the reference LSID server implementation. After a normal lookup has occurred, it is possible to cache the results to disk using a directory and file structure for keys. These cached copies can be used in subsequent data and metadata requests.
When working with LSIDs from different authorities, you and your co-workers will most likely be resolving a similar group of LSIDs. There are billions of LSIDs, and only a small portion will be requested. You will also most likely be repeatedly resolving the same LSID over a period. Knowing these facts, it is possible to set up a caching proxy for LSIDs for your workgroup or organization to improve the resolution process. A proxy can improve the quality of service since it will be very close to the end user geographically, reducing extranet bandwidth use and lowering total time to resolution completion. It also helps reduce the load on the main authority server. This helps eliminate upstream authority congestion that can affect your other resolutions.
The client cache exists for the same reasons as the caching proxy. It further lowers the time to resolution completion and eliminates load on the upstream caching proxies and authority server. It allows for offline access to the LSID data and metadata to previously accessed LSIDs. There is one distinction between the client and proxy caches worth noting: The proxy cache returns the same metadata document that would have been retrieved from the authority. The client cache may store this document, or it is also possible that it will be stored into a persistent RDF model, such as jena. In this scenario, the client is responsible to ensure the freshness of the metadata. This is another situation where an LSID that includes nonprimary metadata can become cached and updates to related LSIDs may not propagate correctly. It is important that implementers of authorities only include primary metadata to ensure proper handling.
A foreign authority is an LSID authority service that points to metadata for LSIDs for which it is not the actual authority as defined by the "authority string" in any LSID. These metadata services are provided independently and generally in addition to those already provided by the actual authority. An example foreign authority might return metadata service endpoints that contain additional metadata (for example annotations) about LSIDs for which it is not the authority.
In the Life Science Identifier specification, you now have a means by which you can uniquely name a data object, as well as a protocol though which client software can retrieve that data object or metadata information about that data object. You can also use the same protocol to retrieve metadata information about that data object from third parties called foreign authorities.
Many people now ask how LSID client software can dynamically discover and retrieve information stored about LSIDs collected by third parties (not the authority), given that there is generally no link stored between the authority for any LSID named object and a random third party that wishes to offer their own metadata about that named data object, for example third-party annotations on existing information.
The current specification leaves this capability undefined, and we are forced to talk about perhaps relying on "Bio-Google" type universal crawler/indexing services to find all mentions of any particular LSID or somehow "magically" knowing to also query against the LSID resolution services provided by collaborators or other well-known metadata sources as these emerge. Neither of these solutions adequately addresses the issue.
The foreign authority notification (FAN) framework provides the formal means by which third parties might provide pointers to their own information that would be retrievable along with the metadata provided by the authority for a Life Science Identifier.
If an LSID authority service wishes to implement FAN, as well, it would provide two additional methods:
Listing 6. FAN methods
notifyForeignAuthority(String lsid, String authorityName) revokeNotifcationForeignAuthority(String lsid, String authorityName)
The optional notify method registers, with the authority for any LSID, the details of the foreign authority service that also knows something about that LSID. The acceptance and storage of notifications is left entirely to the discretion of that LSID's authority. The implementation of these mappings is also left to the implementer of the authority service.
However, all published mappings for an LSID must be returned by some metadata service pointed to by the actual authority and returned along with any other metadata in the getMetaData() method call. The metadata language proposed is RDF. It is recommended something like the following RDF predicate for indicating a foreign authority be returned as metadata for the LSID concerned: urn:lsid:i3c.org:predicates:foreignauthority. Today, it is not settled on what should sit on the right-hand side of this predicate. In an initial implementation, you might return the authority string. However, richer implementations might return an LSID representing the authority itself that would link to semantic information about the foreign authority itself. The revokeNotification method removes the mapping if it exists.
Use the metadata to return the foreign authorities because it allows the use of existing infrastructure and in particular the getMetaData() method call. In addition, rich metadata on the foreign authority may be included to provide trust or context information about the foreign authority. LSID authority names are returned because it allows the client to know the source of the foreign information for trust decisions. Users of third-party metadata for any LSID would need to be made aware that some of the information they are seeing was retrieved from the authority for that LSID, and some was returned from additional third parties. All of these parties, including the original authority, would most likely have different levels of trust in the eyes of the user.
Other candidates were actual authority endpoints and actual metadata/data service endpoints. Authority endpoints were ruled out because the endpoint may change while the authority name is constant. Actual service endpoints were ruled out because there would be no qualification of these endpoints. Furthermore, a foreign authority might add services, and this would require re-registration. Finally, the authority name is the most general and flexible means of identifying a source of information about an LSID.
A FAN service notifyForeignAuthority call contains no built-in means of access restrictions. Authentication, web of trust, or other means could be used for access control and is left to the implementer of a FAN service. Following is a recommended approach for discussion.
When a foreign authority is published, the implementation can actually try to resolve the foreign authority and check if it has any metadata services for the given LSID. This does not solve the problem of foreign authorities presenting bogus metadata or for parties to create notifications without the permission of the third party, but it does make sure that all registered authorities have the metadata available they claim to have. A second security approach would be to secure the publish service itself so that only authorized parties could publish mappings.
Before presenting best practices for providing data over LSID, review the properties of LSID data that will guide this discussion. Primarily, LSID data is simply a sequence of bytes. The fact that this data represents a text document or was written by a certain author is generally part of the metadata. Such metadata may often be encoded in the data, such as a Microsoft Word document stores a document author's name. However, LSID data is always treated as an opaque set of bytes and only specific applications must have the ability to understand the data. In contrast, RDF metadata may be understood by a variety of general-purpose LSID applications. Second, recall that LSID data is immutable. Once a specific LSID has been assigned to a sequence of bytes, those bytes may not change. If the data has been updated, a new, LSID should be issued with a new revision.
Since the data for a given LSID is only a sequence of bytes, it is usually straightforward to determine which bytes a data service should return for an LSID. For example, the LSID might be a JPEG image, so the bytes should be exactly the bytes of the JPEG file. However, some cases do exist where it is not entirely clear what data, if any, should be returned for an LSID. As a first example, consider an LSID that represents the short genetic sequence GATTACA. The data for this LSID should be some representation of the sequence. One choice is to return a text document containing the string
GATTACA. Another choice is to return the sequence in standard FASTA ("fast A") format. Among other characteristics, FASTA files contain a header line marked with ->'. Our data might look like Listing 7.
Listing 7. Sample data bytes in FASTA format
>gene1 This is a gene we found after hours of research GATTACA
Finally, note that some LSIDs may contain no data at all. Such LSIDs are usually abstract or concept LSIDs that contain only metadata. This metadata may reference LSIDs that do contain data. In the next section, get detail on how to use metadata to indicate format and provide connections between abstract and concrete LSIDs.
For a client application to consume LSID data, it must understand the format of the data. LSID data format helps an application process an LSID the same way MIME types help a browser dispatch data to the proper application. Because LSID is transport-independent, you do not want to rely on transport specific properties, such as MIME type, to determine the data format. Some HTTP data services might choose to include a MIME type for convenience, or some SOAP data services might want to include a SOAP header. However, the application should not rely on such constructs. Instead, RDF metadata should be used to describe the data format. As discussed in the metadata section, defer to existing ontologies where possible in lieu of inventing a new set of RDF predicates. In this case, you use http://purl.org/dc/elements/1.1/#format, denoted
Listing 8. dc:format example
<rdf:Description rdf:about="urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027"> <dc:format rdf:resource="urn:lsid:lsid.biopathways.org:formats:fasta" /> </rdf:Description>
Listing 8 indicates that the Genbank nucleotide may be retrieved as a FASTA genetic sequence. Today, RDF resources for all data formats do not exist. Service implementers should use an existing URL -- for the object of the
dc:format statement, for example. In this case, no such URL exists, so we (the authors) invented a reasonable LSID to represent FASTA format.
The data behind a particular concept may have been generated using several algorithms or processes. Each derivation should be assigned a unique LSID, but should also be tied to the underlying concept. For example, a protein sequence could be rendered as a wire frame or a filled out structure stored using the same jpeg format.
The data behind the data bytes of a concept might exist in multiple data formats or derivations. One approach using a single LSID would be to append all different instances together, using some token to separate the different formats. This solution is poor for many reasons, primarily because the client must download all formats. The best approach is to create a different LSID for each data format or for derivations and connect them with a single abstract LSID.
The benefit of using an abstract scheme is that it allows for LSIDs that do not name actual data bytes but instead provide only metadata documents. These LSIDs can be used to represent abstract notions, such as a gene or protein, which may have many concrete representations. The metadata documents associated with these abstract LSIDs can contain multiple relationships pointing to LSIDs that name data bytes.
In this way, researchers can use a series of LSIDs to create an interconnected metadata graph to name objects that may have many different representations. The abstract LSID provides the anchor point for software and users to explore the metadata and obtain further pointers to all the concrete LSID references that contain data, along with the data's exact relationship to the abstract concept. This level of indirection is very powerful.
The details of this approach are best illustrated with an example. Consider again the Genbank nucleotide from the discussion above and suppose that it is also stored as a string of base pairs with no FASTA headers or substitutions. Create two additional LSIDs, one for each data format, and link them with the predicates seen in Listing 9.
Listing 9. LSIDs with data
urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027-fasta urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027-string urn:lsid:lsid.biopathways.org:predicates:storedas denoted bpw:storedas
-fasta are used only for naming convenience. Recall that LSIDs are by definition opaque. Applications should not attempt to infer data format from the name. Instead, the two data LSIDs will contain the following metadata:
Listing 10. Metadata for concrete data LSIDs
<rdf:Description rdf:about="urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027-fasta"> <dc:format rdf:resource="urn:lsid:lsid.biopathways.org:formats:fasta" /> </rdf:Description> and <rdf:Description rdf:about="urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027-string"> <dc:format rdf:resource="urn:lsid:lsid.biopathways.org:formats:string" /> </rdf:Description>
To connect all of these, create an abstract LSID
lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027 that has no data associated. It does have metadata linking it to the concrete LSIDs that do have data.
Listing 11. Abstract LSID with metadata
<rdf:Description rdf:about="urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027"> <bpw:storedas rdf:resource=" urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027-string" /> <bpw:storedas rdf:resource=" urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027-fasta" /> </rdf:Description>
For efficiency purposes, the abstract LSID might contain the
dc:format triples of the concrete LSIDs so that the client may easily choose a concrete LSID to resolve based on format without having to resolve all of them first. This is useful for programs that need to resolve the LSID in specific context and format that application requires.
Abstract LSIDs allow for updates to the data to be easily recognized. Suppose one of these links in the abstract LSID was a "Freshest Version" relationship. For example, a protein might be expressed as a FASTA sequence, an mmCIF data block, a JPEG image in a particular resolution, or a series of publication references. Each of these formats might have multiple versions as data is corrected over time. When the LSID containing the corrected data is given a new version, the abstract LSID can be updated to point to this revision. To reference an instance of the data, the abstract LSID would be resolved and its metadata retrieved. This metadata would be used to discover the LSID for the most recent version of the data which would then be resolved in turn and its data retrieved.
Thoughtful implementation of LSIDs is a prerequisite to achieving the full potential of collaboration and data federation that the LSID resolution network provides. Data models come in all shapes and sizes, and these best practices may not fit every situation. Even so, if you carefully consider the points laid out in this article when implementing your LSID system, your organization will be able to benefit from the natural data sharing capabilities of LSID.
- Get the latest news on the LSID Resolution Protocol Project and view LSID bug reports at IBM Life Science Identifier Project.
- Download the latest LSID file releases at IBM Life Science Identifier Software Releases.
- Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
- Innovate your next open source development project with IBM trial software, available for download or on DVD.
- Browse for books on these and other technical topics.
- Get involved in the developerWorks community by participating in developerWorks blogs.
Dan Smith is a software engineer with the IBM Advanced Internet Technology Group in Cambridge, Mass. His main research interests are distributed systems, data modeling, and agent-based systems on the Semantic Web. He also works on the LSID client and server design, and its implementation in the Java™ language. You can contact him at email@example.com.
Ben Szekely is a software engineer with IBM Internet Technology in Cambridge, Mass. He is the lead developer for the LSID Java Toolkit and is an author of the OMG LSID specification. His related research interests are Semantic Web applications and Semantic Web development tooling. You can contact him at firstname.lastname@example.org.