Reification and Trust: Ontology-driven NLP
CraigTrim 110000G799 Visits (14049)
Reification generally refers to making something real, bringing it into being, or making something concrete. Or, the making of something abstract into something more concrete or real; the action of regarding or treating an idea, concept, etc., as if having material existence. This definition is about as useful as the classic definition for an Ontology.
I don't think the word reflects the idea.
But who cares. It's a simple and powerful concept in RDF, and easy to understand.
So let’s say we log onto Wikipedia or some other trusted internet source, and we want to understand this paragraph:
"William Shakespeare was an English poet and playwright, widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. In 1876, he published Hamlet."
We can use NLP to find entities of interest, and annotate them appropriately.
By the way, what does “annotate tokens appropriately” mean? Annotations are semantic labels that we apply to entities we discover in NLP. Typically, we might have a dictionary of terms, such as a a dictionary of authors, or a dictionary of books. When we find a word or phrase in natural english that has an entry in one of these dictionaries, the title of that dictionary becomes the “annotation” for that word or phrase. We refer to the words and phrases we find in NLP as tokens. The annotation for each token typically fits into a wider semantic context. Such a context is shown in the diagram here. This context can be represented in an Ontology, though this is by no means a technical constraint or requirement.
Now that we’ve identified “William Shakespeare”, “Hamlet” and “1876” as tokens of interest, and applied the correct semantic annotation, we can see how these tokens relate to each other in the wider semantic context. The Ontology model tells me every “Play” is a kind of “Book”, and every “Playwright” is a kind of “Author”. And further, the Ontology model informs me that a “Book” can be attributed to an “Author”, and that a “Book” is published on a “PublicationDate”.
Using inference, I know that because a “Playwright” is a kind of “Author” and because a “Play” is a kind of “Book”, then because “William Shakespeare” is a “Playwright” and “Hamlet” is a “Play”, without any further analysis, I can determine that William Shakespeare may be the author of Hamlet. Now, given more advanced NLP, I could certainly extract the linguistic relationships present in the unstructured text, and make this determination with far more confidence. However, the presence of an author in proximity to the presence of a book, and given our knowledge that books are attributed to authors, we can make a corellation that is likely to be borne out as truthful given a large enough corpus.
Before moving along, it is worth noting that if we were to attempt a deeper linguistic understanding of the source text, we can use the Ontology to drive that. If I find an author and a book in the source text, and want to verify their relationship, I can look to the Ontology to tell me all the various ways “pro
Now we have two triples. In isolation, “Hamlet prov
“What did Shakespeare publish in 1876?”
The answer is not only obvious, but more importantly, it is computationally simple.
However, there's some pretty apparent problems here. While the answer is both obvious and simple, Shakespeare didn't publish anything in 1876. The play "Hamlet" (in any given edition) may have been published in 1876, but not by Shakespeare. It appears the information on Wikipedia was wrong.
Reification allows us to make assertions about our triples
Reification ties into notions of trust and confidence. How much do we trust Wikipedia?
What we want to do in a semantic network (triple store) is not only add data, but add our sources for the data. Then we can begin to associate confidence levels with those sources. When building a triple store you could have multiple sources for data – 100s or 1000s of different sources – whatever. Data can come from structured repositories, from unstructured internet-based sources (like forums or user communities), or semi-structured locations like dbpedia. Or you can open up your triple store to a community of users and allow them to add data. So again, some sources are trust worthy, and some aren’t.
It’s up to you to make that distinction.