Skip to main content

Tip: Data scoring: Convert data with XQuery

Do quality analysis on conversion results

Geert Josten, Consultant/Content Engineer, Daidalos BV
Photo of Geert Josten
Geert Josten has been a content engineer at Daidalos for nearly 10 years, applying his knowledge of XSLT and XQuery as well as other, proprietary transformation languages. He also works as a Web and Java developer at Daidalos and has consulted for dozens of customers in a wide range of areas.

Summary:  The process of converting data is one of migrating information from an unsuitable source or format to a suitable one—often not an exact science. Data scoring is a way to measure the accuracy of your conversion. Discover a simple scoring technique in XQuery that you can apply to the result of a small text-to-XML conversion.

View more content in this series

Date:  29 Sep 2009
Level:  Intermediate PDF:  A4 and Letter (601KB | 15 pages)Get Adobe® Reader®
Activity:  3118 views
Comments:  

Frequently used acronyms

  • HTML: Hypertext Markup Language
  • W3C: World Wide Web Consortium
  • URL: Uniform Resource Locator
  • XML: Extensible Markup Language
  • XSLT: Extensible Stylesheet Transformations

Scoring converted data is all about analyzing the quality of the conversion. Quality can mean different things, and converting data from a database carries with it different problems than converting data from documents with more natural language. The technique that this tip presents makes no assumptions: You can apply it to any XML code of interest. To see the technique in practice, you will convert plain text—not comma-separated files, but plain text from news items grabbed from the Internet.

Plain text input

For the source, I grabbed text from the Google news Web site using the URL http://news.google.com/news/section?pz=1&topic=w&ict=ln. Figure 1 shows the resulting page.


Figure 1. Example news item on Google news
Screen capture of Google news site with multiple news items

These news items have a basic structure: They start with a heading followed by source details, the news message itself, and some additional information from different sources. In this tip, you will extract the headline, location, text, related headline, and the source of the related headline. Figure 2 shows these elements.


Figure 2. Analysis of the text of an example Google news item
Screen capture of Google news page with headline, location, text, source, related headlines labeled

For this example, I selected the top three news items for that moment, copied the text, and removed the lines this tip doesn't discuss, just to save space. I also separated the text into the individual news items to give you a head start.

Line ends

Line ends are always represented by the numeric character 10 in memory when working with XQuery, regardless of the operating system.

You will use the input data in Listing 1. (I added paragraph characters [] to visualize line ends. The entity " represents the double quotation mark character.)


Listing 1. The plain text

let $news-items := (
 "Afghans go to polls under threat of Taliban violence¶
KABUL, Afghanistan (CNN) - Under the menacing threat of violence from the
Taliban, Afghans headed to the polls on Thursday in the war-ravaged nation's
second-ever national election.¶
Afghan people cast votes in hope of better future RT",

  "Security stepped up after Baghdad bombings¶
BAGHDAD, Iraq (CNN) - The Iraqi government implemented new security measures a
day after a string of bombings in Baghdad killed at least 100 people and wounded
hundreds more, an interior ministry official told CNN on Thursday.¶
Questioning security in Baghdad's Green Zone - 19 Aug 09 Al Jazeera",

  "Will Democrats Go At It Alone on the Health Care Bill?¶
This is a rush transcript from "On the Record," August 19, 2009. This copy may not
be in its final form and may be updated.¶
New Health Care Strategy CBS" 
)

To see the scoring technique in practice, you will define patterns to extract information, convert to XML, and apply the technique on the result. The big issue is not finding patterns but converting those patterns to reliable extraction rules. This technique is a useful tool to help you analyze the reliability of your own rules.

Before you start to apply patterns, however, first break down the structure of each news item. Each news item has three lines of data:

  • Headline
  • Message
  • Headline of a related news item

You iterate over all news items and separate the lines by tokenizing on line ends:

let $result :=
  for $news-item in $news-items

  let $lines := tokenize($news-item, '
')

The result variable captures the XML result.


Applying patterns

Analyzing conversion quality is only interesting when information is missing or joined together and must be separated—which this tip covers. Look at each line to search for patterns to separate combined information.

The first line

The first line contains only the headline. Apply normalize-space to trim redundant spaces:

let $headline := normalize-space($lines[1])

The second line

The second line—the one with the message—obviously contains the most information, but has no reliable pattern except for the location of the event. You can find this location at the start of the line, followed by a dash. Use the dash to separate the location from the text:

let $location := normalize-space(substring-before($lines[2], '-'))
  let $text := normalize-space(substring-after($lines[2], '-'))

The third line

The third line is the most challenging: It contains both the headline and the name of the source without a marker to separate them visually. You can't know all the names, so you can't match them literally. Names typically start with a capital letter, which you can accommodate using regular expressions:

let $related-headline :=
    normalize-space(replace($lines[3], '^(.*?) [A-Z].*$', '$1'))
  let $related-headline-source :=
    normalize-space(replace($lines[3], '^.*? ([A-Z].*)$', '$1'))

Basically, you match the string from start (^) to end ($) in these regular expressions. The .*? matches up to the first space-capital combination and should capture the headline text. The [A-Z].* should capture the source of that headline. By putting opening and closing parentheses [( and )] around the part you're interested in and using $1 as the replacement, you should end up with either the headline or the source, depending on where you put the parentheses.


Conversion result

Add the lines in Listing 2 to tag the extracted information.


Listing 2. Converting the extracted information to XML

return

<news-item>{
  if ($headline) then
    <headline>{$headline}</headline>
  else (),

  if ($location) then
    <location>{$location}</location>
  else (),

  if ($text) then
    <text>{$text}</text>
  else (),

  if ($related-headline) then
    <related-headline>{$related-headline}</related-headline>
  else (),

  if ($related-headline-source) then
    <related-headline-source>{
      $related-headline-source
    }</related-headline-source>
  else ()
}</news-item>

The if statements around the tags are to ensure that only those tags are written that contain a non-empty value. You can gather all expressions so far and append the following lines to make it output the first news item:

return
  $result[1]

The lines in Listing 3 show the expected output.


Listing 3. XML output of the first news item

<news-item>
  <headline>Afghans go to polls under threat of Taliban violence</headline>
  <location>KABUL, Afghanistan (CNN)</location>
  <message>Under the menacing threat of violence from the
    Taliban, Afghans headed to the polls on Thursday in the war-ravaged nation's
    second-ever national election.</message>
  <related-headline>Afghan people cast votes in hope of better future</related-headline>
  <related-headline-source>RT</related-headline-source>
</news-item>

Now, convert all three items. Then, you can start to score how well you did.


Element scoring of the result

To analyze the quality of conversion results, you can choose from several methods based on your needs. The technique presented here is basic and you can use it in various ways. It consists of showing statistics at two detail levels:

  • Scoring of each element of interest
  • Scoring of each distinct element value for a particular element of interest

The elements of interest in this conversion result are all elements containing text:

  • headline
  • location
  • text
  • related-headline
  • related-headline-source

The first detail level is merely a matter of counting all elements and calculating the ratio of elements to the total number of items. You can calculate the total using XQuery. (Note that the XML output of the news items was captured in a variable named $result.)

let $element-name = 'headline'
let $element-score := count($result//*[local-name() = $element-name])
let $element-ratio := round(100 * $element-score div count($news-items))

Matching elements on their local name is rather slow but saves code and makes your script more dynamic. In Listing 4, you wrap the calculation results in a small HTML report.


Listing 4. Creating an element scoring result

let $total-number-of-items := count($news-items)

let $elements-of-interest :=
  ('headline', 'location', 'text', 'related-headline',
   'related-headline-source')

return

<html>
  <body>
    <p><b>Total number of items: </b>{$total-number-of-items}</p>

    <table border="1">
      <tr>
        <th>element name</th>
        <th>score</th>
        <th>ratio</th>
      </tr>
      {
      for $element-name in $elements-of-interest

      let $elements :=
        $result//*[local-name() = $element-name]
      let $element-score :=
         count($elements)
      let $element-ratio :=
        round(100 * $element-score div $total-number-of-items)

      return

      <tr>
        <td>{ $element-name }</td>
        <td>{ $element-score }</td>
        <td>{ concat($element-ratio,'%') }</td>
      </tr>
    }</table>
  </body>
</html>

To start, the code in Listing 4 displays the number of items being analyzed to give meaning to the ratios. Then, it loops over all elements of interest, calculates score and ratio, and writes a table row for each. Figure 3 shows the report of the conversion result. (View a text-only version of Figure 3.)


Figure 3. Element scoring of the result
Screen capture     of the result of the element scoring

The scores look high, but there are some drop-outs on location and text. Continue with the scoring of element values and investigate.


Value scoring of the result

Value scoring on large datasets

When you apply this scoring technique on larger datasets, it can result in long calculation times on the reports. Consider testing this code on a small set first, then optimize the code to use the full capabilities of your XQuery processor as soon as the calculation time exceeds one second. If you use an XQuery-capable XML database, you should be able to use indices to make things even quicker.

Scoring the values of elements is almost as straightforward as scoring elements. Just determine the distinct values of each element and calculate a score and ratio for each value.

For additional information, you can calculate the score and ratio of missing elements. Replace the code in Listing 4 with the code in Listing 5.


Listing 5. Creating an element and value scoring report

<html>
  <body>
    <p><b>Total number of items: </b>{$total-number-of-items}</p>
    <table border="1" width="50%">
      <tr>
        <th colspan="2">element name</th>
        <th colspan="2">score</th>
        <th colspan="2">ratio</th>
      </tr>
      {
      for $element-name in $elements-of-interest

      let $elements :=
        $result//*[local-name() = $element-name]
      let $element-score :=
        count($elements)
      let $element-ratio :=
        round(100 * $element-score div $total-number-of-items)

      return (
        <tr>
          <td colspan="2">{ $element-name }</td>
          <td colspan="2">{ $element-score }</td>
          <td colspan="2">{ concat($element-ratio,'%') }</td>
        </tr>,

        let $distinct-values :=
          distinct-values($elements)

        for $distinct-value in $distinct-values

        let $value-score :=
          count($elements[. = $distinct-value])
        let $value-ratio :=
          round(100 * $value-score div $total-number-of-items)
        return

        <tr>
          <td>&#160;</td>
          <td><i>{ $distinct-value }</i></td>
          <td>&#160;</td>
          <td><i>{ $value-score }</i></td>
          <td>&#160;</td>
          <td><i>{ concat($value-ratio,'%') }</i></td>
        </tr>,

        let $missing-score :=
          $total-number-of-items - $element-score
        let $missing-ratio :=
          round(100 * $missing-score div $total-number-of-items)

        where $missing-score > 0
        return

        <tr>
          <td>&#160;</td>
          <td><i><b>(not found)</b></i></td>
          <td>&#160;</td>
          <td><i>{ $missing-score }</i></td>
          <td>&#160;</td>
          <td><i>{ concat($missing-ratio,'%') }</i></td>
        </tr>
      )}
    </table>
  </body>
</html>

The HTML table has six columns in this version of the report, but otherwise, the previous code hasn't changed. Listing 5 only extends the code to perform a sub-loop for each item that iterates over all its distinct values, outputting additional table rows with a score and ratio for each value. It also adds a table row when the appropriate element is missing in at least one of the results.


Analyzing the scores

The scoring report is helpful for at least three use cases:

  • Find drop-outs
  • Inspect value distributions
  • Search for anomalies in values

Drop-outs

In the element score report (Figure 3 or the alternate text version of Figure 3), you saw a low score for location and text. Figure 4 shows the part of the value scoring report for these elements. (View a text-only version of Figure 4.)


Figure 4. Value scoring of location and text
Screen capture of value scoring of location and text

The report shows that one news item in which a location could not be extracted and one in which the text could not be extracted. This absence is likely to concern the same news item. Unfortunately, this report doesn't provide the information necessary to determine why the conversion failed for those items, but it does at least show that it failed. Now, check the content and take a second look at the patterns behind these elements.

Remember that location and text were separated from each other based on the presence of a dash. Return to Listing 1 and notice one news item in which the second line does not begin with a location. There is no dash, so the pattern to separate the items fails.

Value distributions

Apart from spotting drop-outs, you can also use these reports to analyze value distributions. Having converted only the three items, each value occurs only once, as you will see if you run the whole script and look at the report. Use the report on a larger dataset to experiment with this use case.

Anomalies

A third use case is to inspect the element values to spot anomalies. Figure 5 shows the part of the value scoring report that reveals the related-headline and related-headline-source elements. (View a text-only version of Figure 5.)


Figure 5. Value scoring of related headline and source
Screen capture of  the value scoring of related headline

Notice that your extraction rule did succeed but produced incorrect values. The values of the related-headline-source should have been:

  • CBS
  • RT
  • Al Jazeera

Producing correct values here is difficult, but at least this report can help you spot such mistakes. If a lexicon of valid source values is available, you can use it to validate the captured values, then signal for incorrect values or rule them out as false positives.


Conclusion

With simple calculations, you produced small HTML reports of element and value scores on the result of a small text-to-XML conversion. You successfully used these reports to analyze the quality of this conversion, spot drop-outs, look at value distributions, and find anomalies. This scoring technique can be a useful tool for quality analysis of conversion results.



Download

DescriptionNameSizeDownload method
Source file for this tipdata-scoring.xqy.zip2KBHTTP

Information about download methods


Resources

Learn

Get products and technologies

Discuss

About the author

Photo of Geert Josten

Geert Josten has been a content engineer at Daidalos for nearly 10 years, applying his knowledge of XSLT and XQuery as well as other, proprietary transformation languages. He also works as a Web and Java developer at Daidalos and has consulted for dozens of customers in a wide range of areas.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=430372
ArticleTitle=Tip: Data scoring: Convert data with XQuery
publish-date=09292009
author1-email=geert.josten@daidalos.nl
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers