Part 1. Linked Data introduction
Background: The web of documents
Everyone who has ever used a browser is familiar with the World Wide Web that we've been enjoying for many years. This Web — really a web of documents — has provided a foundation for us to share previously unimaginable amounts of information, yet it has some key implementation details that ultimately impose a limit on its usefulness.
The Web represents information as text on pages. It was designed to allow humans to read, filter out redundant information, and infer meaning based on the natural language used, the context of the information, and the existing knowledge of the reader. In other words, we humans glean data from the Web pages that we read. Furthermore, the meaning of relationships between different pieces of information in the Web is never explicit. Again, we humans infer that meaning. HTML just isn't expressive enough to represent typed relationships between defined entities.
The document-centric Web does contain a lot of useful data, though. The problem is, due to the way that it is represented, we can't do as much as we'd like with that data. And just as problematic, if not more so, is that a lot of the data that we could use to help us answer our questions simply isn't published.
When we search the Web, we rely on algorithms employed by search engine indexers to provide links to documents that the indexer believes are relevant, but those might or might not contain the information that we seek. We trust the algorithms used not to exclude useful and relevant information from the result set, and we're expected to filter out any remaining irrelevant information and to combine information that is represented on different pages to try to arrive at answers to our questions.
Imagine being able to perform a search that allows you to expect accurate and relevant answers to even the most complex of the questions that you might ask. Does that sound good? It's what you can expect when, rather than searching a web of documents, you search a web of structured data.
Linked Data (see Resources) refers to a set of best practices for establishing a web of data; that is, publishing and connecting data on the Web. Linked Data can be read by machines, has explicitly defined meaning, can be linked to other data, and can, in turn, be linked to from other data.
When we describe meaning in the context of Linked Data, we are referring to describing data in a form that is understandable by computers.
Linked Data has four main supporting principles, defined by Tim Berners-Lee:
- Use URIs to identify things.
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide some useful information about that thing.
- Include links to other URIs so that additional, related things can be discovered.
Just as the Web relies on Uniform Resource Identifiers (URIs) and the Hypertext Transfer Protocol (HTTP) to provide a hugely scalable architecture for linking documents (HTML pages) regardless of where those documents are physically located, Linked Data uses the same underpinning technology stack to provide the same scalability for linking structured data, regardless of where that data is located.
Representing data in the Resource Description Framework (RDF) enables machines to interpret the data and how it is related to other data. Using RDF, we can build a web of data by describing data and relationships as triples, which consist of a subject, an object, and a predicate (verb).
You can think of a triple as like the structure of a simple sentence. For example, let's take the sentence "Tim Berners-Lee created HTML."
In this sentence, the subject is "Tim Berners-Lee," the object is "HTML," and the predicate (verb) is "created," which describes the relationship between the subject and the object. Remember that we cannot easily describe these typed relationships between entities with the web of documents.
With RDF, subjects are always URIs. Objects can be either URIs of related resources or simple literals, such as a string, date, or number, such as this example: "HTML was invented in 1990." Predicates are also identified by URIs that are collected in vocabularies.
Different vocabularies are used to group predicates that describe types of relationships between data for a given domain. Vocabularies help you understand data and its properties more quickly. For example, OSLC (Open Services for Lifecycle Collaboration) has a vocabulary that describes properties of lifecycle resources and typical relationships between them. A set of common vocabularies has been established. Dublin Core and Friend of a Friend (FOAF) are two examples.
When publishing Linked Data, it is best practice to check whether your data can be represented by using terms from existing vocabularies.
Listing 1. Example of a triple serialized in N3/Turtle
@prefix dc: <http://purl.org/dc/elements/1.1/> . <http://en.wikipedia.org/wiki/HTML> dc:title "HTML"; dc:creator http://www.w3.org/People/Berners-Lee/card .
One key benefit of RDF is that it can represent any kind of data. With RDF, we never need to be concerned that we won't be able to represent some unforeseen data. In other words, traditional concerns about data models (as encountered with relational databases) aren't a worry to us.
Linking Open Data
The best-known example of Linked Data is the Linking Open Data (LOD), project. It was created to identify data sets that exist in the public domain, to publish them, and to link them, using the Linked Data principles to create a publicly accessible web of data.
As of March 2012, this web of open data comprised more than 52 billion RDF triples. Still, these numbers are relatively small when we consider just how much data is estimated to be stored in all of the world's databases (over 3 petabytes). Although a lot of this data will be private, huge amounts could be published to the web of open data. As the web of open data grows, we are able to answer more and more questions that we previously could not, or at least not as easily.
Figure 1. Illustration of the Linked Open Data cloud
Part 2. Linked Lifecycle Data
The traditional approach to data integration
Traditionally, much of the data that we create and manage during the lifecycle of the systems and software projects that we are involved in is not open. Historically, this data has been available only to users of the tool that manages that data. In such closed environments, we're prevented from answering complex questions that rely on our understanding the relationship between data managed by different tools. Furthermore, the tools in such closed environments have no common recognition of the types of data and relationships that they are managing.
This problem is not new and people have attempted to address it in several different ways in the past. Using a single repository to manage all of the data is unrealistic, because it implies a single vendor tie-in and prevents both adoption of the best solutions and integration of data in existing repositories. Point-to-point integrations between multiple tools are brittle and prone to breaking when tool versions are updated. Adoption of universal metadata standards has been unsuccessful due to conflicting priorities and motivations of multiple vendors, This approach also provides no flexibility for individual teams to customize.
A better way: Linked Lifecycle Data
By adopting a loosely coupled approach to integration based on Linked Data principles in systems and software development, we can truly make the most of the data that's being created, without the downsides of approaches that have failed in the past. The foundation of Linked Data and what makes it work are the URIs. The URIs ensure that the data is identifiable even outside of its original context.
Despite the more limited scope of the data (typically managed within a few tools on a private network), we're still able to derive huge benefits from applying Linked Data principles to it. By "opening" the data managed by all of these tools, we can create a well-defined system of tools so that users working within any of the tools in that system can answer complex questions about the projects that they are working on — questions that rely on the union of data held in disparate repositories.
Listing 2 is a Linked Data representation of some development lifecycle data. This set of triples (again in Turtle) shows a linked requirement, test case, and work item.
Listing 2. Linked Lifecycle Data example
@prefix dcterms:<http://purl.org/dc/terms/> . @prefix oslc_cm:<http://open-services.net/ns/cm#> . @prefix oslc_qm:<http://open-services.net/ns/qm#> . <https://rqmtest.ibm.com:9443/rm/requirement.1> dcterms:title "Requirement 1" . <https://rqmtest.ibm.com:9443/qm/testcase.1> dcterms:title "Test case 1"; dcterms:description "The quick fox jumps over ... "; oslc_qm:validatesRequirement <https://rqmtest.ibm.com:9443/rm/requirement.1> . <https://rqmtest.ibm.com:9443/cm/defect:645> dcterms:title "Defect XYZ"; oslc_cm:testedByTestCase <https://rqmtest.ibm.com:9443/qm/testcase.1>; oslc_cm:implementsRequirement <https://rqmtest.ibm.com:9443/rm/requirement.1> .
Part 3. Get more from your data through Lifecycle Query
Understanding data and relationships between pieces of data is critical in working on and managing systems and software projects successfully.
During the course of systems and software development projects, team members create huge amounts of data, such as requirements, tests, design models, code, work items, and so on. Typically, very few of these artifacts exist in isolation. Instead, they are related to one another through links with a specific and well-understood meaning.
Team members and managers need to be able to quickly find and understand data and its relationships to effectively support a given role. That might be, for example, a test practitioner who needs to know which requirements are tested by tests that failed on their last execution or a manager who needs to know how the number of open defects on a given project is changing from week to week.
Problems arise because, traditionally, all of this information has been created and managed in silos — multiple different, often geographically disparate repositories. These repositories are typically provided by different vendors, are deployed on different platforms using different technologies, and expose their data in proprietary formats.
Relational database approaches
With business systems, traditional relational database approaches have been used to try to work around these issues. Periodic snapshots of operational data can be taken by extracting, transforming, and loading (ETL) data from multiple sources into a data warehouse that supports historical, trend-based reporting. Datamarts or data warehouses can be created that support the pre-computation of metrics for faster execution of analytical queries.
With a data warehouse, we have a single source of truth and can efficiently run large and complex queries and reports that contain data created in multiple different tools. Data warehousing brings lots of benefits, but has downsides also: All of the business logic for mapping data to the correct relational database tables, as well as definitions of keys for related tables to represent links, are contained in the ETL jobs. If we change the structure of our source data or load data that we haven't loaded previously, we are forced to change the data warehouse schemas and edit or write new ETLs.
Perhaps the single biggest problem, though, is the fact that the data in the warehouse is most often stale. Typically, ETL jobs are run nightly, which can mean that the data contained within reports run from the warehouse is up to 24 hours old. In many reporting scenarios this is acceptable, but in others, reporting against live data in near real-time (5 to 15 minutes maximum) is essential. Just one common example: When a project manager needs to create a report showing work item backlog for a team meeting.
The Linked Data way
Linked Data is natively stored in repositories called triple-stores. As the name suggests, they store RDF triples. Recall that RDF allows us to be free of any concerns about data models and lets us assume that we can represent any type of data that our systems and software engineers are creating. Now we have the opportunity to create one or more big triple stores that contain data from all of our systems and software development tools, as well as tools that manage related data, such as CAD drawings, BoMs (Bills of Materials), financial data, and so on. Such indices can be updated in near real-time and can act as our web of data for visualization and reporting. They provide a live cache of important lifecycle data, thereby reflecting the truth of the underlying data sources that own and manage the data.
When working against such an index, we can expect to create and run the type of live, large, complex, cross-product, cross-domain queries that we previously could not (at least not within an acceptable window of time). Without an index, to execute the same query from live data, we would have to make many requests to several repositories that are, potentially, geographically scattered. Using a relational data warehouse offers the same advantages in producing complex, cross-product queries and reports, but the data is stale by an amount of time equal to the delta between report execution and the last ETL job.
Experience also shows that most products (data sources) are optimized for operational APIs and very seldom for query and reporting. Like the advantages we get from using a data warehouse for querying, where staleness of data is not an important consideration, bringing the data into a reporting index (triple-store) gives users access to an interface and language optimized for querying: SPARQL, the query language for RDF. The data source itself will benefit from this, because it has to service far fewer nonoperational requests.
It is these types of queries and reports, which combine data from multiple domains, that allow us to answer the types of questions that can give us real (and real-time) insight into our development projects. Linked Lifecycle Data with common vocabularies (such as those defined in OSLC) also allows us to answer questions without caring about the origin of data. Questions such as "Which requirements are not covered by tests?" can now be answered without knowing or caring whether the requirements data originates from IBM®Rational® DOORS® or Rational Requirements Composer or whether the test case data comes from Rational Quality Manager or somewhere else. We have a name for the ability to answer these complex, cross-domain questions about live data: Lifecycle Query.
Rational Engineering Lifecycle Manager
IBM® Rational® Engineering Lifecycle Manager (RELM) is an extension to the Rational solution for systems and software engineering. It builds on a central index and uses Lifecycle Query to provide a set of capabilities to visualize, analyze, and organize engineering lifecycle data.
Figure 2. Rational Engineering Lifecycle Manager
You can use the visualization capabilities to set up custom views that are populated with near real-time data. These provide new perspectives that help your team make more accurate and timely engineering and business decisions or perform tasks more effectively and efficiently.
For example, an automotive safety engineer might want to a view of cross-lifecycle data in the context of the structure of the ISO26262 functional safety standard, or an aerospace engineer might want to see cross-lifecycle data overlaid on an illustration of the aircraft, with data shown over the relevant parts of the aircraft.
The difficulty of finding complete information from across the engineering lifecycle to aid analysis and decision-making cannot be underestimated. Rational Engineering Lifecycle Manager supports lifecycle-wide free-text searches, as well as construction of more intricate queries to answer specific questions. So you can say "Show me all of the engineering artifacts that contain the phrase ' fuel pump'" or focus as closely as "How many requirements for the heart monitor are related to tests that failed on their last execution runs?"
Team-oriented impact analysis features help identify dependencies with downstream impacts that the team might miss, otherwise. For example, if a supplier can no longer supply a particular electronic component, you need to determine what effect there might be if you switch that component to a similar one from another supplier. You need to understand the potential impact on components that it interfaces with and software that controls it. This would typically mean asking multiple teams to perform impact analyses in different tools and assembling the data manually, and that is often a slow, laborious, and error-prone process.
With Rational Engineering Lifecycle Manager, you can find the component in your system design and generate interactive impact analysis views that show you all of the other systems and software design elements, requirements, tests, and work items associated with the component. You can work with the view to filter out any artifacts that are not relevant and assign relevant branches to the appropriate engineering disciplines to analyze.
Rational Engineering Lifecycle Manager also offers the potential for better reporting from this wealth of data. You can get real-time insight into the status of projects, the development of products and systems, and the support readiness reviews at key milestones. And you can generate documents that contain data from across the lifecycle to support audits and that prove compliance and delivery of specifications that form the basis of contractual agreements across the supply chain.
The visualization and analysis capabilities of Rational Engineering Lifecycle Manager are made even more powerful by the ability to organize all of the indexed data to add essential context to it. You can reflect product or system structure that exists in the underlying data and use this in your visualization and analysis. This enables you to easily perform tasks, such as finding all of the requirements, tests, and design artifacts related to the entertainment unit of the Japanese variant of the 2014 model of a vehicle.
Conquer unprecedented complexity by making better use of your engineering data
For engineers, increasingly complex environments mean that day-to-day tasks can end up consuming an inordinate amount of time that could be better spent being creative and productive.
By building on an architectural platform that leverages Linked Data and Lifecycle Query, using Rational Engineering Lifecycle Manager, systems and software engineers get powerful visualization, analysis, and organization functions that help them better understand the large, complex data and relationship sets that they are working with. This understanding and visibility can greatly improve collaboration and make it easier for practitioners to follow processes and implement practices, plus make it easier to manage projects and programs effectively.
Tailored views can be created easily at any level of abstraction to better support the tasks of a user in a particular role. These perspectives and insights along with the ability to more easily understand the impact of change across the engineering lifecycle mean that you can make quicker and more accurate key engineering and business decisions. Generating documents, whether for audits or compliance or to cross contractual boundaries, becomes a much simpler and less error-prone task.
Ultimately, all of this can help organizations develop and deliver systems and software more quickly, with improved productivity, increased efficiency, and lower risk — with the added benefit of protecting investments in existing domain tools.
- Explore the Rational Engineering Lifecycle Manager introduction page and read the overview and the announcement in the Invisible Thread blog.
- Find out more about the Rational solution for systems and software engineering.
- Standards and technologies mentioned in this article:
- Explore the Rational software area on developerWorks for technical resources, best practices, and information about Rational collaborative and integrated solutions for software and systems delivery.
- Subscribe to the developerWorks weekly email newsletter, and choose the topics to follow.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
- Watch developerWorks on-demand demos, ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
Get products and technologies
- Download a free trial version of Rational software.
- Evaluate other IBM software in the way that suits you best: Download it for a trial, try it online, use it in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.
- Join the Rational software forums to ask questions and participate in discussions.
- Ask and answer questions and increase your expertise when you get involved in the Rational forums, cafés, and wikis.
- Join the Rational community to share your Rational software expertise and get connected with your peers.
- Rate or review Rational software. It's quick and easy.
Dig deeper into Rational software on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.