The core Semantic Web technology is RDF, a W3C standard that reduces all data to three-part statements known as triples. If your data fits into the triple data model and is stored in one of the specialized databases known as triplestores, the advantages of Semantic Web technology are obvious. This doesn't mean, though, that the technology has nothing to offer you if your data is in more traditional formats such as relational databases and spreadsheets. Open source and commercial tools are available to convert data in these formats to triples, giving you an easy way to combine data from multiple sources using different formats. Temporary conversion to triples is a great way to do ad-hoc data integration if you want to cross-reference between disparate sources or enhance data from one source with data from another.
As more data becomes publicly available as Linked Data (see Resources), you can use it to enhance locally stored data to make it richer and more interesting. As an example, this article shows how you can augment an Excel® spreadsheet of (fake) analyst buy/sell/hold recommendations with publicly available data about the listed companies to make an attractive, informative report. The input, output, and scripts for this application, all of which use free software, are available for download.
The parts of a triple are officially known as the subject, predicate, and object. If you're from a more traditional database background, you can think of them as a resource identifier, an attribute name, and an attribute value. For example, if a relational database or spreadsheet says that employee 94321 has a hire-date value of 2007-10-14, it would be easy to express this as a triple.
A triple's subject and predicate must be expressed as URIs to make them completely unambiguous, and standard schemas and best practices for doing this with popular domains are gaining maturity. The extra context provided by a unique identifier means that schemas to specify data structures in a collection of triples—although useful for adding metadata that can enable inferencing and constraint checking—are optional. When you consider that the greatest difficulty in combining multiple sets of relational or XML data is lining up the corresponding parts of their schemas, you can see that the lack of a need for RDF schemas makes combining multiple sets of RDF data much simpler—often as simple as concatenating files.
To implement application logic by extracting subsets of data and sorting and rearranging this data, relational databases have SQL, and XML has XSLT and XQuery. RDF has the companion W3C SPARQL standard for querying triples, which is especially handy after you combine a few sets of triples, as you'll see in this article.
Simple spreadsheet in, nice report out
Figure 1 shows the first few rows of the analystRecs.xls Excel file (see Download):
Figure 1. First few lines of Excel spreadsheet used as input data
The data in each row of Figure 1 consists of:
- A fake analyst name
- A company's ticker symbol
- The company name
- A (random) buy/sell/hold recommendation for that company's stock
- The date and time of the recommendation
- An analysis of the company consisting mostly of "lorem ipsum" text traditionally used by publishers to test the appearance of a layout
Figure 2 shows the augmented version of the third company listed on the spreadsheet (IBM) as it appears in the final report:
Figure 2. Final report's enhanced version of data about IBM
In addition to the recommendation, analyst name, company analysis, and recommendation date from the Excel file, Figure 2 includes the stock price that was current when the report-generation script was run, some financial data from last year, and a description of the company. The stock quote data comes the free finance.yahoo.com web service, which returns data as CSV, and the 2009 numbers and company descriptions come from Wikipedia.
Figure 3 shows the general architecture of the application that created the report from these inputs:
Figure 3. Architecture diagram for application
As Figure 3 shows:
- Data from the spreadsheet, the Yahoo! web service, and Wikipedia are converted to RDF.
- The RDF data is sorted, cross-referenced, and converted to XML by a SPARQL engine.
- An XSLT engine converts the XML to HTML to generate the final report.
I used free software to implement this entire architecture, with a 10-line (not counting white space and comments) batch file named build.bat driving the whole process. You could implement the same architecture with commercial tools such as TopQuadrant's TopBraid Composer, which automates several of the development steps and requires less coding. More important than the specific implementation, however, is that you can now choose among an increasing array of tools to implement architectures for flexible, ad-hoc data integration using Semantic Web data standards.
I chose to use Python for converting the Excel data to RDF. The free Lingfo Python library does the difficult job of reading binary Excel files, so with little effort I wrote a short Python script that produces an RDF triple for each cell read in.
The finance.yahoo.com website offers a web service that returns a comma-separated list of recent stock ticker data when you give it the right URL. For example, the following URL retrieves data about IBM, Nokia, and Honda:
http://download.finance.yahoo.com/d/quotes.csv?f=sl1d1t1ohgv&e=.csv&s=IBM,NOK,HMC |
The returned data has a line for each requested company, with the data fields specified by the f= parameter in the URL requesting the data.
Because the web service does the work of retrieving the stock ticker data and storing it in a simple data structure, I only needed to convert that data to RDF. Perl's built-in split() function makes it easy to split apart delimited text, so I wrote a short Perl script named yahooCSV2RDF.pl to do that. (I could have done the CSV-to-RDF conversion using Python instead, but I've written so many similar scripts in Perl before that I knew that I could crank out this one quickly.) A big part of this script's logic consists of just combining the date and time fields, such as 7/22/2010 and 4:00pm, to a more ISO 8601-compliant format like 2010-07-22T16:00:00, which is easier to sort and reuse in other systems.
Conversion of the Wikipedia data required no extra work on my part. The DBpedia project has created and stored over a billion triples from the fielded "infobox" data you often see in a gray rectangle at the right of a Wikipedia page. For example, if you look at the Wikipedia page for Nokia (http://en.wikipedia.org/wiki/Nokia), you'll see the infobox; if you look at the DBpedia page for Nokia (http://dbpedia.org/page/Nokia), you'll see Nokia's infobox information and more, ready to be queried with the SPARQL query language.
DBpedia's website includes a page where you can enter SPARQL queries about the data (http://dbpedia.org/snorql/). Entering the following query there requests all the DBpedia triples about Vodafone:
CONSTRUCT { <http://dbpedia.org/resource/Vodafone> ?p ?o }
WHERE { <http://dbpedia.org/resource/Vodafone> ?p ?o }
|
If you store this query in a URI (after doing the appropriate character escaping, which is a single function call in most programming languages), you can use that URI to retrieve the Vodafone triples with any program that can send a URI and retrieve the results. The open source cURL utility is one such program. For example, if you enter the following on a Windows® or Linux® command line, cURL downloads a copy of Google's home page and stores it in the googleHomepage.html file:
curl http://www.google.com > googleHomepage.html |
You can call cURL from an application driven by a batch file or shell script. To retrieve company data from DBpedia, build.bat calls another batch file named getDbpediaData.bat, in which each line calls cURL to retrieve data for a particular company. The URL provided as a parameter to each of these calls contains an escaped version of a SPARQL CONSTRUCT query similar to the one above.
Combining and querying the data
After gathering RDF versions of all the data, I use another SPARQL query, in Listing 1, to extract the data for the final report:
Listing 1. SPARQL query to extract data for the final report
# PickDataForReport.spq: select data from the combination of RDF sources
# used to create the fake analyst report. Bob DuCharme 2010 no warrantee
# expressed or implied.
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
PREFIX xs: <http://www.w3.org/2001/XMLSchema#>
PREFIX ar: <http://www.snee.com/ns/analystRatings#>
PREFIX sq: <http://www.rdfdata.org/2009/12/stockquotes#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?tickerSymbol ?coName ?analyst ?description ?recommendation
?recDateTime ?lastPrice ?quoteDateTime ?dayHigh ?dayLow ?openingPrice
?volume ?revenue ?netIncome ?abstract
WHERE {
?analystData ar:analyst ?analyst ;
ar:company ?coName;
ar:description ?description ;
ar:tickersymbol ?tickerSymbol ;
ar:recommendation ?recommendation ;
ar:date-time ?recDateTime .
?quoteData sq:tickerSymbol ?tickerSymbol ;
sq:lastPrice ?lastPrice ;
sq:dateTime ?quoteDateTime ;
sq:dayHigh ?dayHigh ;
sq:dayLow ?dayLow ;
sq:lastPrice ?lastPrice ;
sq:openingPrice ?openingPrice ;
sq:volume ?volume .
# Next line is why we added language tags to coName: so we
# could compare it to ?dbURI rdfs:label properly.
?dbpURI rdfs:label ?coName .
?dbpURI <http://dbpedia.org/ontology/revenue> ?revenue .
OPTIONAL {
?dbpURI <http://dbpedia.org/ontology/netIncome> ?netIncome .
} .
OPTIONAL {
?dbpURI <http://dbpedia.org/ontology/abstract> ?abstract .
FILTER (lang(?abstract) = "en") .
} .
}
|
This SPARQL query includes a bit of logic to identify which data from the various files goes together, but that's fairly simple. If the query looks a little long, that's only because it lists many fields to extract from the data.
To run this local SPARQL query, I use the arq command-line utility, which is part of the open source Java™-based Jena framework (see Resources). The build.bat line that calls arq provides three kinds of information with each call:
- The source (or in this case, sources) of the input data
- The name of the file with the SPARQL query to run
- An indication that I want the results in SPARQL Query Results XML Format
The result is saved to a file called reportData.xml.
The command uses the --data switch to tell arq which data to use as input. Although the architecture diagram in Figure 3 shows data from the three sources getting combined, I needed no special utility to combine them. The batch file just adds four of these --data parameters to the command line that calls arq (the fourth being for a file of company names created by the application for easier cross-referencing). Listing 2 shows this line of the batch file, split here for space reasons, as the Java program is called with the various parameters and the results are redirected to the reportData.xml file:
Listing 2.
arq command specifying data sources, location of SPARQL query, and results formatjava -cp %CP% arq.arq --results=XML --query=PickDataForReport.spq --data=analystRecs.rdf --data=coNamesWithLangTags.n3 --data=quoteData.rdf --data=dbpedia.n3 > reportData.xml |
What is SPARQL Query Results XML Format? Most SPARQL discussions focus on the query language and the protocol that provide the Q and P in its name. The Query Results XML format is a separate specification that defines an XML structure for data returned by a SPARQL query processor (see Resources). This format is straightforward enough to be processed by a relatively simple XSLT stylesheet, which brings me to the last step of the batch file: creation of the final HTML report.
The most difficult part of the project—pulling data from three sources and
combining it into a format that typical tools can use—is now finished, and it
wasn't so difficult. All that's left is to convert the XML generated by the last step
into HTML, which the batch file does with the XSLT stylesheet named
SPARQLXML2HTML.xsl. The XML returned by the query engine has a regular, table-like
structure, which makes it pretty easy for the stylesheet to find what it needs when it
creates the analystReport.html report. (I used the xsltproc
XSLT processor included with the open source libxslt C
library to apply the stylesheet. To use Xalan or Saxon instead, you only need to
change the order of the parameters on the command line.)
A query result conforming to the SPARQL Query Results XML Format specification is one big sparql element with header information in a head element and the actual results in a results element. The results element holds a result element for each row in the result, and each result element has a binding element for each value in that row.
Listing 3 shows the XSLT stylesheet template from SPARQLXML2HTML.xsl to handle the result elements:
Listing 3. XSLT stylesheet template for handling
result elements
<xsl:template match="s:result">
<tr><td colspan="3"><hr/></td></tr>
<tr>
<td><xsl:apply-templates select="s:binding[@name='coName']"/></td>
<td colspan="3"><xsl:apply-templates select="s:binding[@name='tickerSymbol']"/>
$<xsl:apply-templates select="s:binding[@name='lastPrice']"/>
at <xsl:apply-templates select="s:binding[@name='quoteDateTime']"/>
</td> </tr>
<tr>
<td><b>2009 figures:</b></td>
<td>
<b>Revenue </b>
<xsl:apply-templates select="s:binding[@name='revenue']"/>
</td>
<td>
<b>Net Income </b>
<xsl:apply-templates select="s:binding[@name='netIncome']"/>
</td>
</tr>
<tr>
<td colspan="3">
<xsl:apply-templates select="s:binding[@name='abstract']"/>
</td>
</tr>
<tr class="rec">
<td>
<b>Analyst </b>
<xsl:apply-templates select="s:binding[@name='analyst']"/>
</td>
<td>
<b>Recommendation </b>
<xsl:apply-templates select="s:binding[@name='recommendation']"/>
</td>
<td>
<b>Date </b>
<xsl:apply-templates select="s:binding[@name='recDateTime']"/>
</td>
</tr>
<tr>
<td colspan="3">
<xsl:apply-templates select="s:binding[@name='description']"/>
</td>
</tr>
</xsl:template>
|
The template in Listing 3 creates a single HTML table row in which each cell holds a piece of returned information or a label for that information.
The build.bat batch file ties all this together in 10 steps:
-
Set the Java
CLASSPATHto include thearqutility. This line assumes that theARQROOTenvironment variable has been set to the home directory of the utility. (See Resources for documentation onarq.) -
Using the xls2rdf.py Python script, read the data from the analystRecs.xls spreadsheet and save it as RDF in the analystRecs.rdf file.
-
With
arqand a SPARQL query stored in the augmentAnalystRecs.spq file, read the company names from analystRecs.rdf and create a file called coNamesWithLangTags.n3 that includes a language tag ofenwith each company name. This makes it easier to cross-reference the data with DBpedia data. -
Apply the MakeGetTickerInfo.xsl XSLT stylesheet to analystRecs.rdf, outputting a batch file called getTickerInfo.bat that retrieves stock quote data from the Yahoo! service.
-
Run getTickerInfo.bat, which retrieves the Yahoo! data as a CSV file called quoteData.csv.
-
Run the yahooCSV2RDF.pl Perl script to convert quoteData.csv to the quoteData.rdf RDF file.
-
Apply the MakeGetDbpediaData.xsl XSLT stylesheet to analystRecs.rdf to create the getDbpediaData.bat batch file, which retrieves company data from DBpedia. Each line of this batch file calls cURL, passing it a parameter of the URI that has an embedded SPARQL query to retrieve the data for one company.
-
Run the getDbpediaData.bat batch file and put the output in the dbpedia.n3 file.
-
Run the SPARQL query in PickDataForReport.spq against the four data files created earlier (analystRecs.rdf, quoteData.rdf, coNamesWithLangTags.n3, and dbpedia.n3), and put an XML version of the result in reportData.xml.
-
Apply the SPARQLXML2HTML.xsl XSLT stylesheet to the reportData.xml file to create the final report file, analystReport.html.
Other tools might implement the same architecture more simply by consolidating and automating some of these steps, but the basic work to do would remain the same.
Combining data from three sources to create a report on nine companies is a pretty small-scale use of the techniques I've demonstrated. The open source D2RQ interface lets you treat relational data as a collection of triples so that this data can be incorporated into such an application. For example, picture a financial institution that stores data about its customers' stock portfolios in a relational database system such as MySQL, DB2, or Oracle. A small addition to the application described in this article can make it generate customized reports similar to analystReport.html for each customer, listing only that customer's holdings, including—by multiplying the stock price from Yahoo! by number of shares owned—their current value.
This entire application has no need for the RDF Schema or OWL ontology languages. This kind of schema metadata is also stored as triples, though, so if you did want to incorporate it, it's one more class of input that can enrich the data you're combining to create your report. Storing rules, equivalencies, and other relationship metadata using these schema standards instead of storing them in compiled code adds the advantage of easier maintenance and portability of your business logic.
The growing world of Linked Data sources provides more and more data for you to incorporate into your applications, and Semantic Web standards bring new kinds of flexibility to mix and match this data with other data sources. Many enterprises are also finding tools that conform to these standards valuable for combining data from multiple silos within their own organizations, before adding in any publicly available data. New tools are available to make all of this easier, but as you've seen here, even some familiar, trusted older tools and techniques can play an important role in bringing this data together in useful new ways.
| Description | Name | Size | Download method |
|---|---|---|---|
| Batch file, scripts that it calls, and sample data | comboPubPrivData.zip | 86KB | HTTP |
Information about download methods
Learn
- SPARQL Query Results XML Format: Learn more about this format from the W3C specification.
- Downloading Yahoo data: See all the options for the finance.yahoo.com stock quote service.
- About the Linking Open Data dataset cloud: Explore the Linked Data cloud of public linked data sources.
arq: See documentation for thearqcommand-line applications.- Search RDF data with SPARQL (Philip McCarthy, developerWorks, May 2005): Take a tour of SPARQL's features and learn how to use SPARQL queries from Java applications with the Jena Semantic Web Toolkit.
- Build Wikipedia query forms with semantic technology (Bob DuCharme, developerWorks, July 2009): Find out how your applications can include Linked Data by using public SPARQL endpoints.
- XML area on developerWorks: Get the resources you need to advance your skills in the XML arena.
- My developerWorks: Personalize your developerWorks experience.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks. Also, read more XML tips.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- Jena - A Semantic Web Framework for Java: Get
arqand more Semantic Web development tools from the Jena project. -
cURL: Download the cURL utility to retrieve web
resources from the command line. cURL is installed by default on most UNIX® and Linux systems; a Windows version is available for download.
- Lingfo: Get the Lingfo Python library for reading Excel files.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks blogs: Check out these blogs and get involved.

Bob DuCharme is a solutions architect at TopQuadrant, a provider of software for modeling, developing, and deploying Semantic Web applications. He's written four books and nearly one hundred online and print articles about information technology without using the word "functionality" in any of them. See http://www.snee.com/bobdc.blog for his weblog.



