Integrate disparate data sources with Semantic Web technology

Combine data in multiple formats into a single report, using free tools to automate their conversion to RDF

Different sets of RDF data are much easier to combine than different sets of data in other common formats. You can easily convert disparate non-RDF data sets to RDF and then combine them to create new content. In this article, learn how to integrate spreadsheet data, CSV data from a web service, and fielded data from a website into a single report.

Share:

Bob DuCharme, Solutions Architect, TopQuadrant

Photo of Bob DuCharmeBob DuCharme is a solutions architect at TopQuadrant, a provider of software for modeling, developing, and deploying Semantic Web applications. He's written four books and nearly one hundred online and print articles about information technology without using the word "functionality" in any of them. See http://www.snee.com/bobdc.blog for his weblog.



28 September 2010

Also available in Chinese Japanese

The core Semantic Web technology is RDF, a W3C standard that reduces all data to three-part statements known as triples. If your data fits into the triple data model and is stored in one of the specialized databases known as triplestores, the advantages of Semantic Web technology are obvious. This doesn't mean, though, that the technology has nothing to offer you if your data is in more traditional formats such as relational databases and spreadsheets. Open source and commercial tools are available to convert data in these formats to triples, giving you an easy way to combine data from multiple sources using different formats. Temporary conversion to triples is a great way to do ad-hoc data integration if you want to cross-reference between disparate sources or enhance data from one source with data from another.

Frequently used acronyms

  • CSV: Comma-separated value
  • HTML: HyperText Markup Language
  • RDF: Resource Description Framework
  • SPARQL: SPARQL Protocol and RDF Query Language
  • URI: Uniform Resource Identifier
  • URL: Uniform Resource Locator
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language
  • XSLT: Extensible Stylesheet Language Transformations

As more data becomes publicly available as Linked Data (see Resources), you can use it to enhance locally stored data to make it richer and more interesting. As an example, this article shows how you can augment an Excel® spreadsheet of (fake) analyst buy/sell/hold recommendations with publicly available data about the listed companies to make an attractive, informative report. The input, output, and scripts for this application, all of which use free software, are available for download.

Working with triples

The parts of a triple are officially known as the subject, predicate, and object. If you're from a more traditional database background, you can think of them as a resource identifier, an attribute name, and an attribute value. For example, if a relational database or spreadsheet says that employee 94321 has a hire-date value of 2007-10-14, it would be easy to express this as a triple.

A triple's subject and predicate must be expressed as URIs to make them completely unambiguous, and standard schemas and best practices for doing this with popular domains are gaining maturity. The extra context provided by a unique identifier means that schemas to specify data structures in a collection of triples—although useful for adding metadata that can enable inferencing and constraint checking—are optional. When you consider that the greatest difficulty in combining multiple sets of relational or XML data is lining up the corresponding parts of their schemas, you can see that the lack of a need for RDF schemas makes combining multiple sets of RDF data much simpler—often as simple as concatenating files.

To implement application logic by extracting subsets of data and sorting and rearranging this data, relational databases have SQL, and XML has XSLT and XQuery. RDF has the companion W3C SPARQL standard for querying triples, which is especially handy after you combine a few sets of triples, as you'll see in this article.

Simple spreadsheet in, nice report out

Figure 1 shows the first few rows of the analystRecs.xls Excel file (see Download):

Figure 1. First few lines of Excel spreadsheet used as input data
Screen capture of first three rows of data in analystRecs.xls.

The data in each row of Figure 1 consists of:

  • A fake analyst name
  • A company's ticker symbol
  • The company name
  • A (random) buy/sell/hold recommendation for that company's stock
  • The date and time of the recommendation
  • An analysis of the company consisting mostly of "lorem ipsum" text traditionally used by publishers to test the appearance of a layout

Figure 2 shows the augmented version of the third company listed on the spreadsheet (IBM) as it appears in the final report:

Figure 2. Final report's enhanced version of data about IBM
Screen capture of the enhanced IBM data as it appears in the final report

In addition to the recommendation, analyst name, company analysis, and recommendation date from the Excel file, Figure 2 includes the stock price that was current when the report-generation script was run, some financial data from last year, and a description of the company. The stock quote data comes the free finance.yahoo.com web service, which returns data as CSV, and the 2009 numbers and company descriptions come from Wikipedia.

Figure 3 shows the general architecture of the application that created the report from these inputs:

Figure 3. Architecture diagram for application
Diagram of application architecture

As Figure 3 shows:

  1. Data from the spreadsheet, the Yahoo! web service, and Wikipedia are converted to RDF.
  2. The RDF data is sorted, cross-referenced, and converted to XML by a SPARQL engine.
  3. An XSLT engine converts the XML to HTML to generate the final report.

I used free software to implement this entire architecture, with a 10-line (not counting white space and comments) batch file named build.bat driving the whole process. You could implement the same architecture with commercial tools such as TopQuadrant's TopBraid Composer, which automates several of the development steps and requires less coding. More important than the specific implementation, however, is that you can now choose among an increasing array of tools to implement architectures for flexible, ad-hoc data integration using Semantic Web data standards.


Converting the data

Excel to RDF the hard way

Many free utilities and features of commercial products can convert spreadsheets to RDF. However, the existing free utilities are complex, because they try to handle a variety of spreadsheet layouts. They're poorly documented, because they're often academic projects, so they're difficult to use. Also, to the free tools, "reading a spreadsheet" usually means reading a CSV file exported from a spreadsheet program, not an actual binary spreadsheet file.

I chose to use Python for converting the Excel data to RDF. The free Lingfo Python library does the difficult job of reading binary Excel files, so with little effort I wrote a short Python script that produces an RDF triple for each cell read in.

The finance.yahoo.com website offers a web service that returns a comma-separated list of recent stock ticker data when you give it the right URL. For example, the following URL retrieves data about IBM, Nokia, and Honda:

http://download.finance.yahoo.com/d/quotes.csv?f=sl1d1t1ohgv&e=.csv&s=IBM,NOK,HMC

The returned data has a line for each requested company, with the data fields specified by the f= parameter in the URL requesting the data.

Because the web service does the work of retrieving the stock ticker data and storing it in a simple data structure, I only needed to convert that data to RDF. Perl's built-in split() function makes it easy to split apart delimited text, so I wrote a short Perl script named yahooCSV2RDF.pl to do that. (I could have done the CSV-to-RDF conversion using Python instead, but I've written so many similar scripts in Perl before that I knew that I could crank out this one quickly.) A big part of this script's logic consists of just combining the date and time fields, such as 7/22/2010 and 4:00pm, to a more ISO 8601-compliant format like 2010-07-22T16:00:00, which is easier to sort and reuse in other systems.

Conversion of the Wikipedia data required no extra work on my part. The DBpedia project has created and stored over a billion triples from the fielded "infobox" data you often see in a gray rectangle at the right of a Wikipedia page. For example, if you look at the Wikipedia page for Nokia (http://en.wikipedia.org/wiki/Nokia), you'll see the infobox; if you look at the DBpedia page for Nokia (http://dbpedia.org/page/Nokia), you'll see Nokia's infobox information and more, ready to be queried with the SPARQL query language.

DBpedia's website includes a page where you can enter SPARQL queries about the data (http://dbpedia.org/snorql/). Entering the following query there requests all the DBpedia triples about Vodafone:

CONSTRUCT { <http://dbpedia.org/resource/Vodafone> ?p ?o } 
WHERE { <http://dbpedia.org/resource/Vodafone> ?p ?o }

If you store this query in a URI (after doing the appropriate character escaping, which is a single function call in most programming languages), you can use that URI to retrieve the Vodafone triples with any program that can send a URI and retrieve the results. The open source cURL utility is one such program. For example, if you enter the following on a Windows® or Linux® command line, cURL downloads a copy of Google's home page and stores it in the googleHomepage.html file:

curl http://www.google.com > googleHomepage.html

You can call cURL from an application driven by a batch file or shell script. To retrieve company data from DBpedia, build.bat calls another batch file named getDbpediaData.bat, in which each line calls cURL to retrieve data for a particular company. The URL provided as a parameter to each of these calls contains an escaped version of a SPARQL CONSTRUCT query similar to the one above.


Combining and querying the data

After gathering RDF versions of all the data, I use another SPARQL query, in Listing 1, to extract the data for the final report:

Listing 1. SPARQL query to extract data for the final report
# PickDataForReport.spq: select data from the combination of RDF sources
# used to create the fake analyst report. Bob DuCharme 2010 no warrantee
# expressed or implied.
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
PREFIX xs: <http://www.w3.org/2001/XMLSchema#>
PREFIX ar: <http://www.snee.com/ns/analystRatings#>
PREFIX sq: <http://www.rdfdata.org/2009/12/stockquotes#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?tickerSymbol ?coName ?analyst ?description ?recommendation 
  ?recDateTime ?lastPrice ?quoteDateTime ?dayHigh ?dayLow ?openingPrice 
  ?volume ?revenue ?netIncome ?abstract
WHERE {
    ?analystData ar:analyst ?analyst ;
                 ar:company ?coName;
                 ar:description ?description ;
                 ar:tickersymbol ?tickerSymbol ;
                 ar:recommendation ?recommendation ;
                 ar:date-time ?recDateTime .

    ?quoteData sq:tickerSymbol ?tickerSymbol ;
               sq:lastPrice ?lastPrice ;
               sq:dateTime ?quoteDateTime ;
               sq:dayHigh ?dayHigh ;
               sq:dayLow ?dayLow ;
               sq:lastPrice ?lastPrice ;
               sq:openingPrice ?openingPrice ;
               sq:volume ?volume .

    # Next line is why we added language tags to coName: so we
    # could compare it to ?dbURI rdfs:label properly.
    ?dbpURI rdfs:label ?coName . 
    ?dbpURI <http://dbpedia.org/ontology/revenue> ?revenue .
    OPTIONAL {
        ?dbpURI <http://dbpedia.org/ontology/netIncome> ?netIncome .
    } .
    OPTIONAL {
        ?dbpURI <http://dbpedia.org/ontology/abstract> ?abstract .
        FILTER (lang(?abstract) = "en") .
    } .
}

This SPARQL query includes a bit of logic to identify which data from the various files goes together, but that's fairly simple. If the query looks a little long, that's only because it lists many fields to extract from the data.

To run this local SPARQL query, I use the arq command-line utility, which is part of the open source Java™-based Jena framework (see Resources). The build.bat line that calls arq provides three kinds of information with each call:

  • The source (or in this case, sources) of the input data
  • The name of the file with the SPARQL query to run
  • An indication that I want the results in SPARQL Query Results XML Format

The result is saved to a file called reportData.xml.

The command uses the --data switch to tell arq which data to use as input. Although the architecture diagram in Figure 3 shows data from the three sources getting combined, I needed no special utility to combine them. The batch file just adds four of these --data parameters to the command line that calls arq (the fourth being for a file of company names created by the application for easier cross-referencing). Listing 2 shows this line of the batch file, split here for space reasons, as the Java program is called with the various parameters and the results are redirected to the reportData.xml file:

Listing 2. arq command specifying data sources, location of SPARQL query, and results format
java -cp %CP% arq.arq --results=XML --query=PickDataForReport.spq 
  --data=analystRecs.rdf --data=coNamesWithLangTags.n3 --data=quoteData.rdf 
  --data=dbpedia.n3 > reportData.xml

What is SPARQL Query Results XML Format? Most SPARQL discussions focus on the query language and the protocol that provide the Q and P in its name. The Query Results XML format is a separate specification that defines an XML structure for data returned by a SPARQL query processor (see Resources). This format is straightforward enough to be processed by a relatively simple XSLT stylesheet, which brings me to the last step of the batch file: creation of the final HTML report.


Generating the report

XML's versatility

XML is what flows through the pipelines of modern electronic publishing systems, and you can do a lot more with this project's XML than just convert it to HTML. You can write XSLT stylesheets to create PDF versions using the W3C XSL-FO standard; you can create epub electronic books suitable for reading on Sony ebook readers, iPhones, and iPads; or, you can store the XML in an XML-based database to query later with XQuery.

The most difficult part of the project—pulling data from three sources and combining it into a format that typical tools can use—is now finished, and it wasn't so difficult. All that's left is to convert the XML generated by the last step into HTML, which the batch file does with the XSLT stylesheet named SPARQLXML2HTML.xsl. The XML returned by the query engine has a regular, table-like structure, which makes it pretty easy for the stylesheet to find what it needs when it creates the analystReport.html report. (I used the xsltproc XSLT processor included with the open source libxslt C library to apply the stylesheet. To use Xalan or Saxon instead, you only need to change the order of the parameters on the command line.)

A query result conforming to the SPARQL Query Results XML Format specification is one big sparql element with header information in a head element and the actual results in a results element. The results element holds a result element for each row in the result, and each result element has a binding element for each value in that row.

Listing 3 shows the XSLT stylesheet template from SPARQLXML2HTML.xsl to handle the result elements:

Listing 3. XSLT stylesheet template for handling result elements
<xsl:template match="s:result">

  <tr><td colspan="3"><hr/></td></tr>

  <tr>
    <td><xsl:apply-templates select="s:binding[@name='coName']"/></td>
    <td colspan="3"><xsl:apply-templates select="s:binding[@name='tickerSymbol']"/>
    $<xsl:apply-templates select="s:binding[@name='lastPrice']"/>
    at <xsl:apply-templates select="s:binding[@name='quoteDateTime']"/>
  </td>    </tr>

  <tr>
    <td><b>2009 figures:</b></td>
    <td>
      <b>Revenue </b>
      <xsl:apply-templates select="s:binding[@name='revenue']"/>
    </td>
    <td>
      <b>Net Income </b>
      <xsl:apply-templates select="s:binding[@name='netIncome']"/>
    </td>
  </tr>

  <tr>
    <td colspan="3">
      <xsl:apply-templates select="s:binding[@name='abstract']"/>
    </td>
  </tr>


  <tr class="rec">
    <td>
      <b>Analyst </b>
      <xsl:apply-templates select="s:binding[@name='analyst']"/>
    </td>
    <td>
      <b>Recommendation </b>
      <xsl:apply-templates select="s:binding[@name='recommendation']"/>
    </td>
    <td>
      <b>Date </b>
      <xsl:apply-templates select="s:binding[@name='recDateTime']"/>
    </td>
  </tr>

  <tr>
    <td colspan="3">
      <xsl:apply-templates select="s:binding[@name='description']"/>
    </td>
  </tr>

</xsl:template>

The template in Listing 3 creates a single HTML table row in which each cell holds a piece of returned information or a label for that information.


The driver batch file

The build.bat batch file ties all this together in 10 steps:

  1. Set the Java CLASSPATH to include the arq utility. This line assumes that the ARQROOT environment variable has been set to the home directory of the utility. (See Resources for documentation on arq.)

  2. Using the xls2rdf.py Python script, read the data from the analystRecs.xls spreadsheet and save it as RDF in the analystRecs.rdf file.

  3. With arq and a SPARQL query stored in the augmentAnalystRecs.spq file, read the company names from analystRecs.rdf and create a file called coNamesWithLangTags.n3 that includes a language tag of en with each company name. This makes it easier to cross-reference the data with DBpedia data.

  4. Apply the MakeGetTickerInfo.xsl XSLT stylesheet to analystRecs.rdf, outputting a batch file called getTickerInfo.bat that retrieves stock quote data from the Yahoo! service.

  5. Run getTickerInfo.bat, which retrieves the Yahoo! data as a CSV file called quoteData.csv.

  6. Run the yahooCSV2RDF.pl Perl script to convert quoteData.csv to the quoteData.rdf RDF file.

  7. Apply the MakeGetDbpediaData.xsl XSLT stylesheet to analystRecs.rdf to create the getDbpediaData.bat batch file, which retrieves company data from DBpedia. Each line of this batch file calls cURL, passing it a parameter of the URI that has an embedded SPARQL query to retrieve the data for one company.

  8. Run the getDbpediaData.bat batch file and put the output in the dbpedia.n3 file.

  9. Run the SPARQL query in PickDataForReport.spq against the four data files created earlier (analystRecs.rdf, quoteData.rdf, coNamesWithLangTags.n3, and dbpedia.n3), and put an XML version of the result in reportData.xml.

  10. Apply the SPARQLXML2HTML.xsl XSLT stylesheet to the reportData.xml file to create the final report file, analystReport.html.

Other tools might implement the same architecture more simply by consolidating and automating some of these steps, but the basic work to do would remain the same.


Expanding the application

Combining data from three sources to create a report on nine companies is a pretty small-scale use of the techniques I've demonstrated. The open source D2RQ interface lets you treat relational data as a collection of triples so that this data can be incorporated into such an application. For example, picture a financial institution that stores data about its customers' stock portfolios in a relational database system such as MySQL, DB2, or Oracle. A small addition to the application described in this article can make it generate customized reports similar to analystReport.html for each customer, listing only that customer's holdings, including—by multiplying the stock price from Yahoo! by number of shares owned—their current value.

This entire application has no need for the RDF Schema or OWL ontology languages. This kind of schema metadata is also stored as triples, though, so if you did want to incorporate it, it's one more class of input that can enrich the data you're combining to create your report. Storing rules, equivalencies, and other relationship metadata using these schema standards instead of storing them in compiled code adds the advantage of easier maintenance and portability of your business logic.

The growing world of Linked Data sources provides more and more data for you to incorporate into your applications, and Semantic Web standards bring new kinds of flexibility to mix and match this data with other data sources. Many enterprises are also finding tools that conform to these standards valuable for combining data from multiple silos within their own organizations, before adding in any publicly available data. New tools are available to make all of this easier, but as you've seen here, even some familiar, trusted older tools and techniques can play an important role in bringing this data together in useful new ways.


Download

DescriptionNameSize
Batch file, scripts that it calls, and sample datacomboPubPrivData.zip86KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=528158
ArticleTitle=Integrate disparate data sources with Semantic Web technology
publish-date=09282010