Similar to the way people consult dictionaries to define the meaning and context of words, artificial intelligence (AI) systems rely on good quality entity dictionaries or, more importantly, being able to build up-to-date ones for any given concept. In fact, in many Information Extraction (IE) tasks, a powerful building block for any sophisticated extraction is being able to identify the key entities of the domain, and it is widely recognized that this is one of the keys to successful extraction. This is an essential process that allows us to teach AI systems faster and more efficiently than previously possible.
Since I joined IBM Research in March 2017, I started focusing on the core operations in IE. For example how to quickly build high quality dictionaries for various concepts on any given language. I had the opportunity to connect the powerful set of statistical methods in place at the Intelligence Augmentation lab with my background on Semantic Web. We started working on methods that leverage existing ontologies to obtain initial examples of particular concepts and expand them according to the particular task at hand, employing a human-in-the-loop approach.
A dictionary is an ever-evolving artifact
There are several examples in daily life where the need of constantly updating concepts is critical. A familiar case is the one of online shops that must integrate new product descriptions provided by vendors on a daily basis. The features and vocabulary used to describe the products continuously evolve, with different vendors providing the product descriptions in varied writing styles and standards. Despite these differences, to fully integrate new products (e.g., be able to provide meaningful comparison shopping grids), merchants must correctly identify and assign equivalences to all these instances.
The evolution of dictionaries is not confined to products (or other naturally growing sets). Even concepts that we would assume as simple and stable, for example color names, are constantly evolving. The way color names evolve in different languages can be quite dissimilar, given the cultural differences in how we express them in different countries. For instance, a new color name, mizu, has recently been proposed for addition in the list of Japanese basic color terms. On a more practical level, capturing the right instances for a concept can also be highly task-dependent: as our users learned during the experiment, they discovered “space gray,” “matte black” and “jet black” are all relevant colors for mobile phones, while “white chocolate” or “amber rose” are colors of wall paint products.
Our goal is to design an AI training technique for concept expansion and maintenance which is (i) completely language independent, (ii) combines statistical methods with human-in- the-loop and (iii) exploits Linked Data as bootstrapping source. We carried experiments on a publicly available medical corpus and on a Twitter dataset and demonstrate that we can achieve comparable performances regardless of language, domain and style of text.
The full details of our experiments are available in: “Multi-lingual Concept Extraction with Linked Data and Human-in-the-loop” (Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski and Steve Welch) which will be presented at K-CAP 2017 (on the 6th of December 2017). The preliminary ideas of this work were also discussed at ISWC 2017 (October 2017): “Language Agnostic Dictionary Extraction.” (Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski and Steve Welch).
There is still a lot to be done. And currently we are working on reducing the number of user operations needed by making a more in-depth usage on available Linked Data and noise reduction techniques.