Natural Language Understanding of Unstructured Data
CraigTrim 110000G799 Visits (7891)
What do we mean by unstructured data?
We limit this to ASCII text. Could also mean scanned documents, images, etc. There is no data about this data – no meta data. The only information we have about this data is contained in the data itself. There are no rows, columns or annotations.
Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables.
Unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated.
Grammar is Metadata
But it's really not fair to say that there is no meta data. Language has an inherent structure. We call that the grammar. Structured languages (like Java and COBOL), have structured grammars . Unstructured languages (like English) also have grammars . The grammar describes how the words in a sentence relate to each other. Do two words form together into a single unit? The grammar will tell you. Or perhaps one word describes an action and several words form together into a phrase, describing the recipient of that action. Understanding the grammar of language can help us understand some of the meaning that is being conveyed.
Natural language can not be fully described by its grammar . So the grammar of English can only go so far in telling us about what was just spoken or in analyzing what was written.
While the grammar (or syntax) describes structure, and acts as an implicit form of meta-data for text, there is also the annotation of text with the meaning of entities (semantic types). For example, if you run across the word “IBM” it will likely function as a Noun and might bear a relationship in that capacity to other words in the sentence. Semantic typing will reveal that IBM is a Company. So what are the kinds (or, types) of entities that your words are referring to? IBM refers to a company. Queen Elizabeth refers to a Dignitary. Christmas Tree refers to either a device that controls the flow of oil and gas, or a Christmas decoration.
Finding the type of a word is referred to as hypernymonization. That is, finding the hypernym (or, above word) for a word. The hypernym of IBM is company. The hypernym of company is organization. Organization is a kind of entity.
This is only one facet of Semantic Typing. Other semantic types might involve finding (or understanding) the parts of something (meronyms). A keyboard is a part of a laptop. A DVD-ROM is a part of a laptop. Both keyboard and DVD-ROM have a hypernym of Storage Device. Thus, a storage device is part of a laptop.
Semantic Typing can involve making any relation to any thing. Hyperyms and Meronyms are the most common, but there is no limitation here.
Context is everything
Context must be relevant to the domain. If was over-hearing a conversation with drillers, I would have less confidence that “Christmas Tree” meant an ornament. If I was taking part in a conversation with my wife about putting up Christmas decorations, I would not immediately assume she was referring to a need to control drilling fluids. In the context of a conversation, confidence levels are maintained, often at a subconscious level.
In fact, knowing exactly which entity is being referred to by a certain word is not entirely trivial. It requires a knowledge of the larger context. What is the domain under consideration? If the flow of conversation or the topic was based around holiday gift giving, the understanding that Christmas Tree referred to a decoration would be reasonable. In the Oil & Gas domain, this is not necessarily a given. In fact, in the course of some conversations, it would be entirely unreasonable to assume a seasonal connotation.
As humans, we are usually very good about context. When we initiate a conversation there is a great deal of assumed context and shared context. So we generally perform semantic disambiguation without thinking about it. But one of the challenges of natural language understanding is that everything we do automatically, without thinking about it, the computer doesn't. There is a lot of context necessary to build an understanding that can resemble simple human comprehension.
Understanding Entities and their Relationships
Finding the type of an entity is only half the story for semantics. There's two parts here – understanding the entities involved, and their relationships to each other and to the outside world. For natural language understanding to be successful, we must understand not only the entities that are in play in our domain, but how those entities are related to each other, and possibly to things that lie just outside their domain.
In order to prevent us from modeling the entire universe, we typically draw the line by gathering input as to what a computer system must support. What sorts of questions should the expert system answer? This helps us scope out the industry terminology gathering exercise. It's good to gather input from as many people as possible. This helps set expectations around the expert system as well. The range and variations of questions that a potential user or client stakeholder may expect the system to answer can be very surprising.
How is this information represented once it is gathered? Semantic information should be represented by means of a semantic model. We use a graph store, or more specifically a triple store. You'll hear me refer to the words Ontology and Triple Store somewhat interchangeably throughout this conversation.
To be clear, when I refer to an Ontology I am referring to:
Notice how this definition of an Ontology nicely dovetails into the definition of Natural Language Processing (NLP), that is:
An Ontology Model is physically a collection of triples.
Think about a graph:
In an Ontology:
The relationship of on entity to another is typically referred to as a triple. The first node is the subject, the relationship is the predicate, the last node is the object. Subject, Predicate, Object. Laptop is-a Computer. Keyboard is-a-part-of Laptop, Websphere runs-on Windows, etc. There's no limitation to what you can say, and how you can express a relationships between two things.
I might express the relationship that Laptop is a Computer, and then I can start to detail various types of Laptops (a Dell, a Lenovo, a Lenovo Thinkpad, a Lenovo Thinkpad T400, etc). Typically, the types of things are what we put into a triple store. But there's no point in making the distinction to clear. An Ontology describes a model, and a Triple Store describes an instantiation. This is a simple definition with a minimum of pedantry.
We use Ontologies to drive NLP.
Gathering Industry Terminology
Therefore, when we speak about gathering industry terminology, we're really speaking about creating Ontologies. Or to put this in a far simpler, and less impressive way, we're looking for entities and how they relate to each other.
There are many techniques for doing this. One of the most common, and indeed how we often start, is by performing language modeling on our data. We are looking for common entities (single words that occur with high frequency – or “unigrams”), and entities that occur together frequently with other entities (“bigrams”), and pairs of three (trigrams), etc. Take the term “Pum
Not only that, but we found other entities that occurred in the collocative proximity. Like Cuttings Settling in the Wellbore. We can refine this relationship with an SME if we choose, but by initially relying on some degree of frequency, the system can begin to understand that a relationship does exist.
How the Learning Occurs
The process of refining and checking a relationship or an inference is called learning. Inference is really just assuming a relationship exists, when one was never explicitly made. Inference is powerful, because it allows us to reach conclusions that were never explicitly modeled in the data. If you populate and maintain a relational database, you're not going to wake up one morning and say – hey! There's a new foreign key that just appeared between two tables. I wonder how that happened? An RDBMS just doesn't work that way. Every relationships is modeled in advance. When you create an RDBMS you assert that you understand the entire domain, and that the domain is relatively static. When you create an Ontology, you assert that the entire domain is quite possibly too large to be understood fully, and that it is not static. But very dynamic. The relationships is a semantic repository can and will surprise you.
Learning is the process of providing feedback to a knowledge base about certain relationships. We've described a triple store up until now as a network graph, as a collection of nodes with edges (or connections), and where each of those connections has a semantic label – the relationship (or predicate name). And that's correct. But you can also think of the same network graph as having a collection of numbers associated with each connection. The number represents a probability.
I might come across the word “monitor”. What is this word, and what does it mean? It could be “monitor” in terms of an IT Component – like a computer monitor. Or, it could be monitor in terms of an activity, like monitoring someone's work. In one case the word is used as a noun, and in another case, the word is used as a verb. And again, in one case the word has a semantic type of “IT Component” and in another case it has the semantic type of “Activity”. In each of these cases, we have a triple. Perhaps “monitor has part of speech noun” and “monitor is an IT Component”. Likewise, in each case, we're going to have a confidence in this relationship. We might be very confident that the word has a certain meaning. And the confidence of one relationship might very well impact the confidence we have in another relationship. If we are very confident that monitor is being used as a noun, we can likewise have high confidence that monitor is referring to an IT component. But if thought there was a strong possibility that monitor was being used a verb in the sentence, we would have to reserve some level of probability for monitor as an Activity.
This is the incorporation of confidence (or probabilities) into the triple store. These probabilities are used to form calculations – ultimately, how confident are we in a given answer?
But learning itself occurs when we have a given probability, and then have to update that probability in the face of new evidence. When we accumulate new evidence, and this could be in the form of feedback from a user of the system (perhaps a thumbs up or thumbs down), or it could be in the form of acquiring new data sources. The acquisition of new data will impact existing probabilities. This is where the learning occurs. In other words, how does the system update its beliefs in the face of new evidence?
When we refer to learned models, we refer not only to the use of numerical probability values, but the ability to provide this feedback, judge this feedback and adjust this feedback. A good deal of the work that goes into the system – the coding, configuration and modeling, is centered around the learned models. They have to be initialized, adjusted, and trained. And the training doesn't end. The only time the training would conceivably end is if the domain the system was trained in, ceased to change. And assuming that data is never constant, a system will need to continuously learn.
But the important thing is, development does end. Development is not an ongoing exercise. And training can and is provided (for example), by means of feedback mechanisms that are present in the system, when the system goes into production.
What about the provenance of the data? There are notions of trust and confidence there. The people, processes and activities that were responsible for creating or influencing a particular piece of data.
The provenance of digital objects represents their origins. Provenance records contain descriptions of the entities and activities involved in producing and delivering (and otherwise influencing) a given object. PROV is meant to describe how these objects were created or delivered.
By knowing the provenance of an object, we can make determinations about how to use it.
Provenance can be used for many purposes, such as:
Building and Managing the semantic repository
Well, the processes are only non-traditional in the sense that having a need to use, and finding a way use unstructured data is non-traditional. And we might as well say big data, because if were dealing with little data, we should just model the domain by hand. But if it's big enough, dynamic enough, then we need to move from supervised (that is, manual approaches) into some semi-supervised / supervised methods. That is – get as close to full automation as possible, while realizing that a degree of manual oversight will always be necessary.
Ontologies provide semantic context. Identifying entities in unstructured text is a picture only half complete. Ontology models complete the picture by showing how these entities relate to other entities, whether in the document or in the wider world.
Without a semantic model, the annotations used by the NLP parser lose their link to the real world. Who decides what an annotation should be named? Are they making this decision in coordination with what already exists? What modeling discipline exists? Designing an NLP solution without an Ontology will lead to NLP annotations, that over time, have no link to the real world.
Interactive Query Refinement (IQR)
Most queries suffer from unde
What the system asks, is determined not only by the user intent, but by the answer that the system is returning. If the system is going to return search results, then what the system really wants from the user is terms that will help form really useful queries that can return relevant search results. General terms, or terms that are less than useful, will be discarded.
If the system is attempting to provide an answer, or a ranked list of answers, or perhaps some numerical form of an answer, then the specific parts of the query necessary to return that answer are required. And if they aren't present, the use is asked for them.
In past projects, we have referred to the necessary parts of a query as “Pillars”. That is, the most important things that are necessary to create a perfect query. Because users don't always know, and indeed, shouldn't really have to know, which data sources to access, and which terms make the best queries for each data source.
So IQR does this for us. It's a dynamic process. We don't hard code questions. But questions are dynamically formed based on the need of the knowledge base. Depending on the intent of the user and what the system needs to respond to that intent, a question is formed. Now, there are general types of questions. We have the ability to recognize and formulate over 20 different types of dialog. For example, the system might recognize a procedural question from the user (How do I do X?). Or an imperative statement (tell me how to do Y … ). The system can likewise formulate questions of these types as well. We know what the basics are of these questions – for example, a procedural is often a cause/effect type dialog with a action followed by an object.
When we formulate a question back to the user, the system selects an appropriate type, then plugs in the right values in a grammatically correct manner. This is actually a non-trivial sort of an exercise. For example, even something as simple as knowing when to use the correct determiner in front of a noun can be challenging.
If the user wants to know about their own phone number the system might formulate a response with the phrase:
“... your phone number … ”
But if the user asked about a phone number (not any one in particular, and not one that is associated with them) the system will formulate a response:
“... a phone number … ”
and of course, there is always the response:
“...the phone number...”
Because if the user was talking about a definite phone number, we want the system to respond in turn. If the dialog is good, the user will ignore it and assume the formulate of it was trivial. If the dialog is wrong, it makes the system look stupid. This in turn can prevent the user from forming a trust relationship with the system, and hinder adoption rates.
A question might be open-ended, or it might contain a list of possibilities.
In one real world example, a user asked a question that contained the acronym “GOM”. The system was initially trained on a client-provided (enterprise) corpus of acronyms and their expansions. We wrote a parser to ingest the data and we had trained the system against an enterprise search index to find out which expansions were more valuable than others. The result of this exercise resulted in probabilities being associated to various expansions. The “Gulf of Mexico” was the most likely expansion. “Grumpy Old Men” had a 0% probability. It simply didn't occur in the domain (no search results), therefore it was a useless search term.
Because there were other choices that had a degree probability to significant to ignore, the system chose to formulate a question. In the case of choices with 0, probabilities, we didn't even put these into the list of possibilities back to the user. But for the rest of the choices, we ranked these from most to less likely. We also set a threshold (which is fully adjustable by those who are maintaining the system) that determined when a question should be asked, and when the choice should simply be inferred.
Now, each time “Gulf of Mexico” is chosen, the system builds up some additional confidence. Eventually, it is entire possible that this question will no longer be asked. Note that the surrounding context is taken into consideration. So we don't necessarily consider the expansion of GOM => Gulf of Mexico independently of the context it's used in. We do up to a point, but the surrounding tokens and relationships are taken into account too. There is configuration and modeling work required to pull this off.