How dictionaries act as strong foundations for training AI systems

Share this post:

Similar to the way people consult dictionaries to define the meaning and context of words, artificial intelligence (AI) systems rely on good quality entity dictionaries or, more importantly, being able to build up-to-date ones for any given concept. In fact, in many Information Extraction (IE) tasks, a powerful building block for any sophisticated extraction is being able to identify the key entities of the domain, and it is widely recognized that this is one of the keys to successful extraction. This is an essential process that allows us to teach AI systems faster and more efficiently than previously possible.

Since I joined IBM Research in March 2017, I started focusing on the core operations in IE. For example how to quickly build high quality dictionaries for various concepts on any given language. I had the opportunity to connect the powerful set of statistical methods in place at the Intelligence Augmentation lab with my background on Semantic Web. We started working on methods that leverage existing ontologies to obtain initial examples of particular concepts and expand them according to the particular task at hand, employing a human-in-the-loop approach.

A dictionary is an ever-evolving artifact

There are several examples in daily life where the need of constantly updating concepts is critical. A familiar case is the one of online shops that must integrate new product descriptions provided by vendors on a daily basis. The features and vocabulary used to describe the products continuously evolve, with different vendors providing the product descriptions in varied writing styles and standards. Despite these differences, to fully integrate new products (e.g., be able to provide meaningful comparison shopping grids), merchants must correctly identify and assign equivalences to all these instances.

The evolution of dictionaries is not confined to products (or other naturally growing sets). Even concepts that we would assume as simple and stable, for example color names, are constantly evolving. The way color names evolve in different languages can be quite dissimilar, given the cultural differences in how we express them in different countries. For instance, a new color name, mizu, has recently been proposed for addition in the list of Japanese basic color terms. On a more practical level, capturing the right instances for a concept can also be highly task-dependent: as our users learned during the experiment, they discovered “space gray,” “matte black” and “jet black” are all relevant colors for mobile phones, while “white chocolate” or “amber rose” are colors of wall paint products.


This image represents an exemplar dictionary in Spanish. It’s arranged in the shape of a “thumbs up” to symbolize the human-in-the-loop that approves/rejects the system’s moves.

Our goal is to design an AI training technique for concept expansion and maintenance which is (i) completely language independent, (ii) combines statistical methods with human-in- the-loop and (iii) exploits Linked Data as bootstrapping source. We carried experiments on a publicly available medical corpus and on a Twitter dataset and demonstrate that we can achieve comparable performances regardless of language, domain and style of text.

The full details of our experiments are available in: “Multi-lingual Concept Extraction with Linked Data and Human-in-the-loop” (Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski and Steve Welch) which will be presented at K-CAP 2017 (on the 6th of December 2017). The preliminary ideas of this work were also discussed at ISWC 2017 (October 2017): “Language Agnostic Dictionary Extraction.” (Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski and Steve Welch).

There is still a lot to be done. And currently we are working on reducing the number of user operations needed by making a more in-depth usage on available Linked Data and noise reduction techniques.


More AI stories

Could AI help clinicians to predict Alzheimer’s disease before it develops?

A new AI model, developed by IBM Research and Pfizer, has used short, non-invasive and standardized speech tests to help predict the eventual onset of Alzheimer’s disease within healthy people with an accuracy of 0.7 and an AUC of 0.74 (area under the curve).

Continue reading

State-of-the-Art Results in Conversational Telephony Speech Recognition with a Single-Headed Attention-Based Sequence-to-Sequence Model

Powerful neural networks have enabled the use of “end-to-end” speech recognition models that directly map a sequence of acoustic features to a sequence of words. It is generally believed that direct sequence-to-sequence speech recognition models are competitive with traditional hybrid models only when a large amount of training data is used. However, in our recent […]

Continue reading

IBM Research at INTERSPEECH 2020

The 21st INTERSPEECH Conference will take place as a fully virtual conference from October 25 to October 29. INTERSPEECH is the world’s largest conference devoted to speech processing and applications, and is the premiere conference of the International Speech Communication Association. The current focus of speech technology research at IBM Research AI is around Spoken […]

Continue reading