Language-independent entity linking makes sense of any text on the web
Let’s take a hypothetical sports headline: Philadelphia Beats New York in 34-26 Nail Biter! For applications like search or question answering, we’d like today’s computers to analyze this piece of text (based on the context, of course) and make sense out of the “entities” (like person, location, organization etc.). In the text above, Philadelphia and New York are the entities but they are highly ambiguous. Does the string “Philadelphia” refer to just the city, or to the Phillies, the Flyers, or maybe the Sixers? If we (humans) read the story, it’ll be obvious to us that it’s actually the Philadelphia Eagles. But how would a computer know that? I’m working on powerful Natural Language Processing algorithms that won’t be language dependent, and will extract that next layer of context based on what our computers already know about us – and knows what we’re reading about.
My team at IBM’s Thomas J Watson Research Center is developing methods of information extraction called Entity Linking. We’re teaching computer systems about these “entities” by disambiguating them with the help of large target catalogs of entities – the biggest one freely available being Wikipedia. This research area has a tremendous application in the next generation of search, called “semantic search.” Hence, if you search for “Peter Moore” and your browser knows (based on your history) that you search mostly journals in chemistry, the first result will be the Wikipedia page of “Peter B. Moore” (the chemist) as opposed to the more obvious choice of the businessman, which most current commercial search engines give us as the first option to click.
But a computer that understands, and wants to give us more information about what we’re interested in is just the start. Every year, the National Institute of Standards and Technology invites researchers from all over the world to test their NLP systems against a “gold standard” corpus of documents. Those with the highest accuracy are invited to speak about their results at NIST’s annual conference.
I’ve been fortunate to represent IBM at the conference for the last two years. Our unique approach? Language independence.
Using Wikipedia to learn context in any language
NIST’s evaluation challenges even the best systems. Their document collection is comprised of data from news articles such as New York Times, to political and entertainment discussion forums and blogs. They want to know if a computer can really resolve complex ambiguous queries in natural language – including documents in Spanish and Chinese.
The task is to extract entities and resolve them to a database called Freebase (a super-set of Wikipedia). There are two major challenges here. One is that Freebase is in English, so participants have to resolve the entities in all the languages back to this English catalog! Also, entities which do not have corresponding Wikipedia pages need to be clustered. NIST does this to determine how accurate the systems are, when presented documents without human annotation on websites like Wikipedia.
Our system is trained using machine learning on the English version of Wikipedia, which is the largest amongst all other languages. Wikipedia is a social knowledge base where a community of users have already created links for various entities, hence we treat this as our positive examples. Wikipedia is an excellent training ground for NLP systems. It’s a huge body of text. It’s frequently updated and changed. And it exists in more than 200 different languages. Our system is still being tested (internally) on the Watson Developer Cloud and can highlight entities and hyperlink them to their Wikipedia entries. It looks for identifiers in the text and in Wikipedia in order to connect each entity with the correct page. For example, a paper about big data analysis that mentions Michael I. Jordan will resolve the string to the Wikipedia page of Berkeley’s Pehong Chen Distinguished Professor (not the actor or retired athlete).
Another question that I’m trying to answer is: will the same system, trained on English, work for other languages? Since the machine learning techniques involved looks at features that only make comparisons with the current document at hand and the corresponding entity’s Wikipedia page, the system can be easily ported to other languages like Spanish or Chinese. Hence, if a Spanish document mentions “Carlos Irwin Estévez,” then we link it to the Spanish Wikipedia page of Charlie Sheen. Since NIST wants the entities to be linked to the English version, we use Wikipedia’s inter-language links, which map any page in Wikipedia to its corresponding other language (if it exists). This “language-independence” approach has not only proven to be accurate (based on our performance in the past evaluations) but also efficient. The system doesn’t waste time training on each language separately, rather, once trained, we don’t have to re-train at all! Other systems trained specifically on other languages still do not port to English datasets in NIST’s past evaluations.
All for better man-machine understanding
I recently filed a patent to put this language-independent NLP technology in a browser plugin or a cellphone app. It could help you sort out the news streaming into your social media feeds by looking at the entities only your friends and followers are talking about. Within the story you decide to read, the plugin will automatically provide the deeper context that made the story relevant. The highlighted terms could link to its Wikipedia page, for example. It could also display the relevant tweets about those entities by using “semantic search” (based on the context) rather than simple “keyword” search.
Going beyond social, this NLP tech could also help clarify ambiguous content in a scientific journal. Medical journals are full of technical and Latin terms that can be misunderstood, even by doctors. Our system could be trained on, for example, all bio-medical Wikipedia pages about proteins. And it could then highlight all – or only certain – proteins, per what a medical professional needs. This way he or she won’t miss this key term, or its meaning within the context of a single document, or volumes of journals.
I recently presented these findings at the international conference, Association for Computational Linguistics (ACL) at Berlin, Germany. You can read more about it in the paper co-written with my IBM Research colleague Radu Florian: One for All: Towards Language Independent Named Entity Linking.