Rosette Linguistics Platform (RLP)
Basis Technology Corp.
One Alewife Center
Cambridge, MA 02140
kameronc 200000130M 1,486 Views
Basis Technology provides software solutions for text mining, information retrieval, and name resolution in Asian, European, and Middle Eastern languages.
Our Rosette Linguistics Platform is a widely adopted suite of interoperable components that delivers high-performance results to search, business intelligence, e-discovery, and many other enterprise applications.
Rosette Linguistics Platform (RLP)
Basis Technology Corp.
One Alewife Center
Cambridge, MA 02140
A great deal (if not all) Natural Language Processing is done these days using some kind of Finite State Machine (FSM). It's pretty simple (really, it is indeed simple). As the name implies, there are a fixed number of possible states in some system. A current state is determined by past states of the system. As such, it can be said to record information about the past, i.e., it reflects the input changes from the system start to the present moment.
Noam Chomsky claimed in Syntactic
Structures (1957) that a human "grammar" ( which
is now referred to as CHL, for Computation of Human Language (Chomsky
(1995)) could not be an FSM. This view is, of course, quite
challenging to the thousands of men and women who work daily to write
NLP software using an underlying FSM. The crux of the argument is
There are some (many) very technical (empirical) arguments against FSM is the underlying mechanism for CHL - but, let's not worry about those (of course, it behooves anyone who wishes to engage in a discussion of these matters to know that a great deal of thought has gone into this, and perhaps it would be better to be informed of this, than not). If we look at the requirements with a simple eye, it really says that for us to create a unique sentence in our language - which we do all the time - our brain can not "look at" anything except what has come before it, it seems a bit limiting. I mean, it is easy to come up with a million sentences which have knowledge of universal ideas, derived meanings from the lexicon, and even what happens next.
I have had a revelation lately: it is not wise to debate this point in my filed. But the reason it is not wise is NOT that it is contentious, and will cause people to dislike me (even though, hey, it was Chomsky's idea, not mine). There is a better reason: it doesn't matter. The principles behind Chomsky's argument are valid for the performance of NLP software, regardless of what machine these software systems use!
In the midst of all that fine verbiage is the simple notion that the computational power of your NLP system is a direct function of the amount of information you can provide to it, and, likewise, the performance of your NLP system is directly limited by the amount of information you provide to it.
The revelation, then, is as follows: if we are going to make non-Markovian grammars, since we will need to have references to anything outside the immediately previous state, then we will have to find out how to do this with utter efficiency, as it will increase the computational complexity (and therefore reduce the performance) exponentially (in the mathematical sense). I think we can all agree on that.
But we can not avoid the deficiency in a system which can have no external reference. It will simply not come close to human language. Even FSM advocates must agree on that.
But it doesn't mean we're in trouble. Just means we will have to be clever. It may be possible to leave the "grammar" - i.e. the decision mechanism - as an FSM, requiring the lowest computational power, but combine this with other FSMs, also efficient, that derive other relevant states to that of the "grammar". The end result, then, could be a federation of FSM outputs, also generated by an FSM.
I recently blogged about TAKMI and how great it is, but I failed to mention its younger sister, MedTakmi! Take a moment out of your busy schedule and read about it:
....note: if you work for IBM, you can access the full text article from the IEEE Digital Library (you will need the userid and password):
This article is a little simpler, and there are no access restrictions.
MedTakmi is currently in use at the National Cancer Center Hospital, in Chou-Ku, Tokyo, Japan, and in the United States, at the May Clinic in Rochester, Minnesota.
Since there is really no need for me to re-write either of these articles here, I will instead point out the shining key that separates MedTakmi from everything else:
Please take a look at this presentation by my new triple store buddy, Craig Trim"
It is especially important for you if you tend to say, "oh, we know ontologies - that's something that's been around for a log time." Let me suggest that the way ontologies are used in sematic relationships have eveolved considerably, and that it is probably important for you to at least makes rue there's nothing here you don't already know
This is copied from my esteemed colleague, Craig Trim and "Natural Language Processing for Online Applications"
One approach to NLP is rooted in linguistic analysis of semantics, syntax, pragmatics and context. It is sometimes characterized as "symbolic", because it consists largely of rules for the manipulation of symbols (eg. grammatical rules that say whether or not a sentence is well formed). Given the heavy reliance of traditional artificial intelligence upon symbolic computation, it has also been characterized informally as "Good old-fashioned AI"
A second approach, which gained wider currency in the 1990s, is rooted in the statistical analysis of language. It is sometimes characterized as "empirical", because it involves deriving language data from relatively large corpora of text, such as news feeds and web pages.
Symbolic NLP vs Empirical NLP
A shorter answer to compare the two approaches:
In one case, the grammar is formally defined vs. empirical where the grammar is whatever is most statistically probable.
Having said that, solutions to real world problems will combine these two approaches. Watson took a more corpus-based approach; Wolfram-alpha seems to take more of the former approach, but it's often hard to delineate the two camps.
Go immediately to this link and join me in congratulating my long-time friend and sensei, Tetsuya Nasukawa, for creating one of the most significant inventions at IBM in the past 100 years:
I hope you're not too jade to feel the power of this, or, to quote the article, the "clout." IBM has had a lot of pretty important inventions. To be named an "icon" among those is no mean feat. TAKMI is at the heart of IBM's flagship text analytics product, IBM Content Analytics (ICA).
I don't know whether it was in an article, or just a person conversation, but Nasukawa-san once said,
"I didn't invent TAKMI to do something humans could do, better; I wanted TAKMI to do something that humans could not do."
I have watched many people struggle to find the "magic" in the software, and get frustrated. I think if that quote were written at the top of the Text Miner user interface, it might help insight-seekers remember just exactly what insight is - something you didn't/couldn't see on your own,
On a cognitive level, it seems that humans can easily see and usually just as easily describe regular distribution patterns, in every day events, around the house, in books, etc. We "get" and like patterns. But when it comes to irregularity, all we seem to be able to do is to smell it - to simply note that something is off. However, we can seem to grasp that irregularity, too, is a pattern. This is one of the strengths of TAKMI. From the above article:
"it is easy for TAKMI to identify irregular distribution of trouble- related expressions"
I might suggest reading that sentence more than once (I did, and I discovered deeper meaning with each reading). Implicit in this statement is that, in life, there is trouble, and that it is easy for humans to see a regular distribution of trouble. For example, after an earthquake, we are not surprised to hear about electrical outages. TAKMI could show us the same. Another example: we can see that organizations tightly associated with Al-Qaeda are implicated in terrorist activities. There is a regular distribution of negative information delineating this relationship, on the internet.
TAKMI goes to the other side, to discover an irregular distribution of trouble - trouble that deviates in ever-so-slightly a fashion from the everyday trouble. These are the kinds of troubles that humans can not detect. Scott Spangler, senior technical staff member in text mining and software development at IBM Almaden Research Center and co-author of Mining the Talk: Unlocking the Business Value in Unstructured Information.
“But what unstructured information can tell you is the answer to questions you didn’t even know you needed to worry about. It lets you
know what you don’t know.”
Seems like we're harping on that same idea again - don't go looking for what you already know, and especially do not go looking for what you think you know! I often say, "let TAKMI do the talking." This is not easy for many people. The difference in approaches to data analysis is actually the essential difference between good science - which we are in short supply of these days - and bad science. Here's the key:
Be descriptive, not prescriptive.
Without your intervention, TAKMI will find all the patterns there are to find in your data corpus. TAKMI is not trying to find anything - it is simply setting up distribution norms for the entire data set (the particular data set), then, distribution norms for subsets of the same data; and finally, ratios between those subsets, which may indicate deviation, or not. For example, if your palette consists of all the colors of the spectrum, does green (for example) stand out, more than any other color? Now, in a collection of red hues, will green stand out more than, says, magenta? How much green will you need, to get it to stand out? Will it require a large patch, or, will one tiny green dot stand out like a beacon, in a field of reds?
Too abstract. If a large number of cars run off the road and crash into trees, it is pretty intuitive that brakes are at fault. I will ager that this is also the case statistically - that if we analyze all the data in the world related to cars running off the road and crashing into trees, we would find faulty brakes to be the number one cause.
So, duh. Would you pay me $2 million for software that tells you that? Do you think auto manufactures don't already focus on braking technology in striving to avoid such accidents? Remember: what humans can not do...
...because you just might never think, nor would there be large, sweeping data trends, to hint that the gas pedal (that's the other pedal, you know,next to the brake) might cause cars to run off the road and crash into trees.