Contents


New machines, XML, and disambiguation

Help the machine resolve contextual information

Comments

Potential not realized

Tablet computers have received a lot of attention recently with the introduction of the Amazon Kindle™ and Apple® iPad, to mention just two options. Their portability and ease of use have the potential to draw words away from the standard print media even more than desktop computers have done already. Such tools highlight a weakness in the digital presentation of words: They do a wonderful job of presenting text but only at a superficial level.

A lot of work has gone into visual presentation. CSS has addressed many issues related to window dressing: fonts, spacing, padding, backgrounds, and more—all visual presentation elements. HTML and JavaScript have also made information more intuitive to a human reader.

But machines are capable of more than this, including understanding the sense of words to a much greater depth than humans currently ask. Among the many problems yet to be addressed—at least in the ordinary workaday world of searches and writing—is disambiguation: What is the true context and, more importantly, how do programmers make the machine aware of that context in a format it can use?

The disambiguation problem

Ambiguity is a well-known problem in IT. This set of data might mean this, or it might mean that. Issues arise in deciphering images. Given a set of pixels, what is this a picture of and what does it resemble in terms of human recognition? Although text presents a more tractable problem and has received some attention in studies related to Natural Language Querying and automatic (but rather inaccurate) disambiguation, commonly presented text still brings with it a vagueness that needs to be resolved: What does the writer really mean?

Consider the following bare statement:

The cow jumped over the moon.

To a human reader, this statement recalls (most probably) an earlier time when nonsense rhymes were a major part of life. Milk cows jumping over Earth's natural satellite was a neat idea but highly improbable. But it could also mean other things. Cows jumping makes more sense if the cow is a cow moose and the moon is the Moon River in Ontario (or perhaps a water buffalo cow jumping the river of the same name that is a tributary of the Mekong). What does this string of words really mean? The answer depends on the context, which is missing. A human can sometimes make inferences based on previous knowledge. A machine cannot and is unable to help the human reader—an important opportunity foregone.

Issues in ambiguity

With respect to text, disambiguation is important for a number of reasons. A few reasons might be:

  • Searches. Do the major search engines ask you to disambiguate your request? No, they rely on you to provide sufficient information by adding terms so that they can make assumptions that are frequently wrong. Sometimes, it's your fault for choosing the wrong terms; other times, the source text is vague. Yet again, the retrieval algorithm might not be helpful.
  • Translation. No matter which language you use to express a word, with few exceptions, search engines resolve to a common root (the concept) which is independent of language. Reference to an absolute root might help make translation more accurate.
  • Lookups. While reading text, a human can look up a word in a dictionary, but a dictionary merely presents a list of alternatives. It does not tell you which alternative the author intended. The author might have intended to be vague (but most probably not).
  • Machine analysis. Given precise disambiguated information about a set of texts, a machine can group the set according to common criteria and make recommendations regarding information sources that are similar—a Dewey Decimal System on steroids.

How to disambiguate?

The most obvious way to remove doubt is to provide an up-front, explicit statement including additional specific details to make the context clear. Providing extra information can help the human reader because it appeals to a lifetime of accumulated knowledge that a human can apply to narrow the context. This process can be costly: If you make a reader consume a paragraph to convey the sense of one word, then time has been wasted that might have been better spent elsewhere.

How do you disambiguate for a machine? To show what an idea means, you can:

  • Link the reference to a known absolute point—say, an entry in a fully disambiguated list such as WordNet (see Related topics for a link)
  • Identify the group to which the idea belongs (a milk cow and moose cow are both animals)
  • Compare a similar concept and note the distance that separates them—that is, a relative reference (a milk cow and a moose cow are similar in this way but different in that way).

Basic disambiguation

As a programmer, you are interested in ways of allowing the machine to help the reader. Here's an example of how you can ask the machine to clarify a general idea:

The <span title='moo-cow, milk cow, etc.' 
    style='color: blue'>cow</span> jumped over the moon.

In this code, the HTML <span> element contains an attribute, title, which, when you hover the mouse over the word in the sentence, provides more specific information. The text appears blue to the reader—a markup indicating that it contains more information if it's required. In this way, each time you are uncertain about the meaning of a word in a sentence, you can hover the mouse over the word, and static disambiguating information will appear as a pop-up. This function is commonplace in browsers today, together with the <abbr> and <acronym> HTML tags and specially programmed JavaScript onmouseover() function. All of this assumes that the reader has a good knowledge of the English language.

"How tedious!" you say. "Disambiguating any text would likely take a writer 10 times as long as writing the text!" My response: "Is the ship of clarity to founder on the rocks of insufficient attention to detail?" Possibly, because I have an even more outrageous suggestion that will increase any perceived tedium. But, of course, the rewards go to those who can make light work of tedious tasks. Hang on: There is more light and less tedium to come.

Advanced disambiguation

Think of the word you see on the page as only the tip of an iceberg. The visible tip has been presented to you (as a human) in a form that you recognize and can interpret given the context surrounding that word. However, beneath the tip is a wealth of other information sinking into the page such as links to related sources both within and outside the current document.

Generalization with XML

You can deal with this issue in the following way using XML. Consider the XML in Listing 1.

Listing 1. Advanced generalization with XML
<?xml version="1.0" encoding="UTF-8" ?>
<!-- Let's call this file myxml.xml -->
<doc>
  <section id='1' hr='The'>The</section>
  <section id='2' hr='cow' wnssid='n02403454'>cow2</section>
  <section id='3' hr='jumped' wnssid='v01963942'>jump16</section>
  <section id='4' hr='over the'>over the</section>
  <section id='5' hr='moon' wnssid='n09358358'>moon2</section>
</doc>

Listing 1 has a root element doc; inside that context is a number of section elements. Each section element has attributes that are useful in a particular context and can be accessed by a reader. Note that, at this point, I chose to disambiguate the nouns and the verb only to make a basic point.

Presentation from XML

The bare XML is of no use to the final reader, so Listing 2 is a basic PHP script that presents the content to an unspecified audience.

Listing 2. Bare extraction
<?php
$xml = simplexml_load_file("myxml.xml"); 
foreach ($xml->section as $sec) {
  echo "$sec ";                  
}
echo "\nDone\n";
?>

This PHP code reads the XML data from Listing 1 into an object variable. Then, using a foreach statement, it lists the bare content of the section elements.

This code results in the string The cow2 jump16 over the moon2, where cow2, jump16, and moon2 are arbitrary labels helpful to neither machine nor human and just used here for illustration. (Actually, they are the index numbers returned by my query to my local copy of the WordNet database.) In this form, The cow1 jump3 over the moon1 would mean something entirely different; but to a human, the difference is not clear.

One of the key principles of presentation is to adapt your message to suit the audience. Listing 3 shows a possible alternative for a human reader.

Listing 3. Extraction for a human
<?php
$xml = simplexml_load_file("myxml.xml"); 
foreach ($xml->section as $sec) {
  echo $sec['hr']." ";
}
echo "\nDone\n";
?>

This code differs from the code in Listing 2 by reporting the hr (human reader) attribute only, resulting in the string, The cow jumped over the moon, which is good because it's where I started and shows that this is a special case of a more general presentation. But it's still not very clear.

Finally, Listing 4 shows the code for the machine reader.

Listing 4. Extraction for a machine
<?php
$xml = simplexml_load_file("myxml.xml"); 
foreach ($xml->section as $sec) {
  echo $sec['wnssid']." ";
}
echo "\nDone\n";
?>

This code is different from Listing 3 in that it pulls the wnssid (the WordNet synsetid) only. Synsets, stored in a database and identified by a unique key in the form annnnnnnn, are groups of words that refer to the same thing. The output, n02403454 v01963942 n09358358, informs the machine of the disambiguated content.

The information is far too specific for a human reader, so assume that the instruction echo in this context makes the information available to another routine that examines the content of the synsets and uses the relevant subordinate information to help the reader. In this way, you can relay the relevant rich database information from WordNet back to the user. You might display cow, jumped, and moon in blue and provide the WordNet information in a pop-up when the reader hovers over the word (as I did earlier in the section "Basic disambiguation"). For example, using the information for jumped from the synset whose ID is v01963942, you can display leap, bound, spring.

You may have noticed that the original "The" and "over the" don't appear in the output. But they are unimportant since the assumption is that the external routine will reconstruct the sentence in its own way from the information it has, perhaps as pictures with arrows connecting them, or using a language which does not use definite articles. In this context, articles, prepositions, conjunctions, and so on might have no role.

In the same way, you can add attributes such as roget='xxx":

Listing 5. Add another attribute
<section id='3' hr='jumped' 
    wnssid='v01963942' 
    roget='309'>jump16</section>

This code makes the disambiguated content of Roget's Thesaurus available to the machine—specifically, item 309, which delves into the concept of "leap" alone, even though "jump" is itemized in other places in Roget, as well. You can also add Wikipedia links or and anything else that is relevant in your context for a reader to pull out later, which takes the audience into account.

Conclusion

Clearly, you can use XML as a tool to help store and reassemble disambiguated text for different audiences. But this seems like a lot of work. Almost no current literature is presented in a disambiguated manner, so someone will have to expend the effort to make it so unless a means is found to disambiguate on the fly—a process that is far from perfect at this time.

However, it might not be necessary to disambiguate an entire text: a thoroughly representative abstract might be sufficient. Then, a search performed on the disambiguated abstract will, in many cases, provide an accurate set of results and at much less cost, because you can search more documents in less time (assuming an abstract shorter than the text) and with proportionately less power. For an example of a simple search engine that uses this type of disambiguation, see DAMSEL (see Related topics for the link).

Is it worth the effort? To some extent, I can present a search with complex terms to reduce ambiguity by specifying "cow water buffalo," if that is what I mean, keeping my fingers crossed that this is how the author expressed the concept originally. However, if the original author wrote in Thai without disambiguation, then searching is that much more difficult.

They say that humans only use 10% of their brain capacity. There's no reason for humans to confine machines to the same fate.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=498182
ArticleTitle=New machines, XML, and disambiguation
publish-date=06292010