New machines, XML, and disambiguation

Help the machine resolve contextual information

Presenting tablet computers with text designed simply for reading by humans lessens the capacity of the machine to help the reader. To move text to a higher level of generality, you need to provide the machine with disambiguated text and the tools to perform more effective searches and analysis. Discover how XML can provide some structure towards this end.

Colin Beckingham, Writer and Researcher, Freelance

Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.



29 June 2010

Also available in Russian Japanese

Potential not realized

Frequently used acronyms

  • CSS: Cascading stylesheet
  • HTML: HyperText Markup Language
  • IT: Information technology
  • XML: Extensible Markup Language

Tablet computers have received a lot of attention recently with the introduction of the Amazon Kindle™ and Apple® iPad, to mention just two options. Their portability and ease of use have the potential to draw words away from the standard print media even more than desktop computers have done already. Such tools highlight a weakness in the digital presentation of words: They do a wonderful job of presenting text but only at a superficial level.

A lot of work has gone into visual presentation. CSS has addressed many issues related to window dressing: fonts, spacing, padding, backgrounds, and more—all visual presentation elements. HTML and JavaScript have also made information more intuitive to a human reader.

But machines are capable of more than this, including understanding the sense of words to a much greater depth than humans currently ask. Among the many problems yet to be addressed—at least in the ordinary workaday world of searches and writing—is disambiguation: What is the true context and, more importantly, how do programmers make the machine aware of that context in a format it can use?


The disambiguation problem

Ambiguity is a well-known problem in IT. This set of data might mean this, or it might mean that. Issues arise in deciphering images. Given a set of pixels, what is this a picture of and what does it resemble in terms of human recognition? Although text presents a more tractable problem and has received some attention in studies related to Natural Language Querying and automatic (but rather inaccurate) disambiguation, commonly presented text still brings with it a vagueness that needs to be resolved: What does the writer really mean?

Consider the following bare statement:

The cow jumped over the moon.

To a human reader, this statement recalls (most probably) an earlier time when nonsense rhymes were a major part of life. Milk cows jumping over Earth's natural satellite was a neat idea but highly improbable. But it could also mean other things. Cows jumping makes more sense if the cow is a cow moose and the moon is the Moon River in Ontario (or perhaps a water buffalo cow jumping the river of the same name that is a tributary of the Mekong). What does this string of words really mean? The answer depends on the context, which is missing. A human can sometimes make inferences based on previous knowledge. A machine cannot and is unable to help the human reader—an important opportunity foregone.


Issues in ambiguity

With respect to text, disambiguation is important for a number of reasons. A few reasons might be:

  • Searches. Do the major search engines ask you to disambiguate your request? No, they rely on you to provide sufficient information by adding terms so that they can make assumptions that are frequently wrong. Sometimes, it's your fault for choosing the wrong terms; other times, the source text is vague. Yet again, the retrieval algorithm might not be helpful.
  • Translation. No matter which language you use to express a word, with few exceptions, search engines resolve to a common root (the concept) which is independent of language. Reference to an absolute root might help make translation more accurate.
  • Lookups. While reading text, a human can look up a word in a dictionary, but a dictionary merely presents a list of alternatives. It does not tell you which alternative the author intended. The author might have intended to be vague (but most probably not).
  • Machine analysis. Given precise disambiguated information about a set of texts, a machine can group the set according to common criteria and make recommendations regarding information sources that are similar—a Dewey Decimal System on steroids.

How to disambiguate?

The most obvious way to remove doubt is to provide an up-front, explicit statement including additional specific details to make the context clear. Providing extra information can help the human reader because it appeals to a lifetime of accumulated knowledge that a human can apply to narrow the context. This process can be costly: If you make a reader consume a paragraph to convey the sense of one word, then time has been wasted that might have been better spent elsewhere.

How do you disambiguate for a machine? To show what an idea means, you can:

  • Link the reference to a known absolute point—say, an entry in a fully disambiguated list such as WordNet (see Resources for a link)
  • Identify the group to which the idea belongs (a milk cow and moose cow are both animals)
  • Compare a similar concept and note the distance that separates them—that is, a relative reference (a milk cow and a moose cow are similar in this way but different in that way).

Basic disambiguation

As a programmer, you are interested in ways of allowing the machine to help the reader. Here's an example of how you can ask the machine to clarify a general idea:

The <span title='moo-cow, milk cow, etc.' 
    style='color: blue'>cow</span> jumped over the moon.

In this code, the HTML <span> element contains an attribute, title, which, when you hover the mouse over the word in the sentence, provides more specific information. The text appears blue to the reader—a markup indicating that it contains more information if it's required. In this way, each time you are uncertain about the meaning of a word in a sentence, you can hover the mouse over the word, and static disambiguating information will appear as a pop-up. This function is commonplace in browsers today, together with the <abbr> and <acronym> HTML tags and specially programmed JavaScript onmouseover() function. All of this assumes that the reader has a good knowledge of the English language.

"How tedious!" you say. "Disambiguating any text would likely take a writer 10 times as long as writing the text!" My response: "Is the ship of clarity to founder on the rocks of insufficient attention to detail?" Possibly, because I have an even more outrageous suggestion that will increase any perceived tedium. But, of course, the rewards go to those who can make light work of tedious tasks. Hang on: There is more light and less tedium to come.


Advanced disambiguation

Think of the word you see on the page as only the tip of an iceberg. The visible tip has been presented to you (as a human) in a form that you recognize and can interpret given the context surrounding that word. However, beneath the tip is a wealth of other information sinking into the page such as links to related sources both within and outside the current document.

Generalization with XML

You can deal with this issue in the following way using XML. Consider the XML in Listing 1.

Listing 1. Advanced generalization with XML
<?xml version="1.0" encoding="UTF-8" ?>
<!-- Let's call this file myxml.xml -->
<doc>
  <section id='1' hr='The'>The</section>
  <section id='2' hr='cow' wnssid='n02403454'>cow2</section>
  <section id='3' hr='jumped' wnssid='v01963942'>jump16</section>
  <section id='4' hr='over the'>over the</section>
  <section id='5' hr='moon' wnssid='n09358358'>moon2</section>
</doc>

Listing 1 has a root element doc; inside that context is a number of section elements. Each section element has attributes that are useful in a particular context and can be accessed by a reader. Note that, at this point, I chose to disambiguate the nouns and the verb only to make a basic point.

Presentation from XML

The bare XML is of no use to the final reader, so Listing 2 is a basic PHP script that presents the content to an unspecified audience.

Listing 2. Bare extraction
<?php
$xml = simplexml_load_file("myxml.xml"); 
foreach ($xml->section as $sec) {
  echo "$sec ";                  
}
echo "\nDone\n";
?>

This PHP code reads the XML data from Listing 1 into an object variable. Then, using a foreach statement, it lists the bare content of the section elements.

This code results in the string The cow2 jump16 over the moon2, where cow2, jump16, and moon2 are arbitrary labels helpful to neither machine nor human and just used here for illustration. (Actually, they are the index numbers returned by my query to my local copy of the WordNet database.) In this form, The cow1 jump3 over the moon1 would mean something entirely different; but to a human, the difference is not clear.

One of the key principles of presentation is to adapt your message to suit the audience. Listing 3 shows a possible alternative for a human reader.

Listing 3. Extraction for a human
<?php
$xml = simplexml_load_file("myxml.xml"); 
foreach ($xml->section as $sec) {
  echo $sec['hr']." ";
}
echo "\nDone\n";
?>

This code differs from the code in Listing 2 by reporting the hr (human reader) attribute only, resulting in the string, The cow jumped over the moon, which is good because it's where I started and shows that this is a special case of a more general presentation. But it's still not very clear.

Finally, Listing 4 shows the code for the machine reader.

Listing 4. Extraction for a machine
<?php
$xml = simplexml_load_file("myxml.xml"); 
foreach ($xml->section as $sec) {
  echo $sec['wnssid']." ";
}
echo "\nDone\n";
?>

This code is different from Listing 3 in that it pulls the wnssid (the WordNet synsetid) only. Synsets, stored in a database and identified by a unique key in the form annnnnnnn, are groups of words that refer to the same thing. The output, n02403454 v01963942 n09358358, informs the machine of the disambiguated content.

The information is far too specific for a human reader, so assume that the instruction echo in this context makes the information available to another routine that examines the content of the synsets and uses the relevant subordinate information to help the reader. In this way, you can relay the relevant rich database information from WordNet back to the user. You might display cow, jumped, and moon in blue and provide the WordNet information in a pop-up when the reader hovers over the word (as I did earlier in the section "Basic disambiguation"). For example, using the information for jumped from the synset whose ID is v01963942, you can display leap, bound, spring.

You may have noticed that the original "The" and "over the" don't appear in the output. But they are unimportant since the assumption is that the external routine will reconstruct the sentence in its own way from the information it has, perhaps as pictures with arrows connecting them, or using a language which does not use definite articles. In this context, articles, prepositions, conjunctions, and so on might have no role.

In the same way, you can add attributes such as roget='xxx":

Listing 5. Add another attribute
<section id='3' hr='jumped' 
    wnssid='v01963942' 
    roget='309'>jump16</section>

This code makes the disambiguated content of Roget's Thesaurus available to the machine—specifically, item 309, which delves into the concept of "leap" alone, even though "jump" is itemized in other places in Roget, as well. You can also add Wikipedia links or and anything else that is relevant in your context for a reader to pull out later, which takes the audience into account.


Conclusion

Clearly, you can use XML as a tool to help store and reassemble disambiguated text for different audiences. But this seems like a lot of work. Almost no current literature is presented in a disambiguated manner, so someone will have to expend the effort to make it so unless a means is found to disambiguate on the fly—a process that is far from perfect at this time.

However, it might not be necessary to disambiguate an entire text: a thoroughly representative abstract might be sufficient. Then, a search performed on the disambiguated abstract will, in many cases, provide an accurate set of results and at much less cost, because you can search more documents in less time (assuming an abstract shorter than the text) and with proportionately less power. For an example of a simple search engine that uses this type of disambiguation, see DAMSEL (see Resources for the link).

Is it worth the effort? To some extent, I can present a search with complex terms to reduce ambiguity by specifying "cow water buffalo," if that is what I mean, keeping my fingers crossed that this is how the author expressed the concept originally. However, if the original author wrote in Thai without disambiguation, then searching is that much more difficult.

They say that humans only use 10% of their brain capacity. There's no reason for humans to confine machines to the same fate.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=498182
ArticleTitle=New machines, XML, and disambiguation
publish-date=06292010