This next partner blog entry is from Dr. David Bean of Attensity who is a significant industry thought leader in the world of text analytics. Attensity is a partner of IBM and has interfaces to our Information Integration Suite of products and shared customers. David provides some great insight into the world of text analytics.
Text analytics can be a confusing topic because there are so many different things you can do with human language. The field of Natural Language Processing (NLP) covers the notion of doing something computational with language -- and that can be everything from speech recognition to web searches, document classification, text summarization, and language generation (i.e. the computer talks back to you!) in between. If we define analytics as the process of quantitatively measuring the occurrence of facts, then text analytics becomes, in a simple sense, the process of counting the facts that are described in textual form.
One common approach to working with text computationally is to search it or categorize it. Searching and categorizing, though, don't lend themselves to text analytics. Why? Because while they may recognize what concepts are mentioned in a document, they don't extract facts. At Attensity, we think of facts as actions or relationships that have many attributes, or dimensions. For example, a traditional point-of-sale fact might include the scan code of the item, the price of the item, the quantity of the purchase, the store code where the purchase was made, and the date of the purchase. That's a sale-fact with five dimensions. That kind of fact has historically been extracted from a set of operational system tables and ETL-ed into a data warehouse, but it can also be represented in free form text, e.g. "Yesterday, Jack bought a dozen large eggs for $2.20 at the Smith's in Bountiful, UT." How do you get from that sentence to a fact table using a search engine? You can't.
That's why we believe that text analytics must rely on a different type of technology...technology that can understand language at a linguistic level. By recognizing language as a linguistic phenomenon (not a statistical one), such a technology will understand that "Jack bought eggs" represents a buying action in which eggs were the thing that was purchased and Jack was the thing that did the purchasing. Without an understanding of linguistic roles and relationships, that sentence becomes a bag of words, and while the term "bought" may be a strong indicator of a purchasing action, it's not clear what was purchased. Was it eggs, or was it a jack (i.e. a hydraulic jack which, believe it or not, I can find at my local Smith's grocery-and-everything-else store)?
Granted, this example is a simplistic one, but it illustrates why we believe that technologies like keyword indexing, classification, and summarization are appropriate for content management applications, but not for text analytics. The good news is that linguistic-based extraction technologies have benefited from more than a decade of solid academic and industrial laboratory work, so they are among the better understood NLP tasks. Of course I'm going to say that Attensity leads the commercialization of those technologies, but even without that self-serving comment, we believe that text analytics as a market cannot mature without systems that understand language as we (humans) do.