How dictionaries act as strong foundations for training AI systems

Share this post:

Similar to the way people consult dictionaries to define the meaning and context of words, artificial intelligence (AI) systems rely on good quality entity dictionaries or, more importantly, being able to build up-to-date ones for any given concept. In fact, in many Information Extraction (IE) tasks, a powerful building block for any sophisticated extraction is being able to identify the key entities of the domain, and it is widely recognized that this is one of the keys to successful extraction. This is an essential process that allows us to teach AI systems faster and more efficiently than previously possible.

Since I joined IBM Research in March 2017, I started focusing on the core operations in IE. For example how to quickly build high quality dictionaries for various concepts on any given language. I had the opportunity to connect the powerful set of statistical methods in place at the Intelligence Augmentation lab with my background on Semantic Web. We started working on methods that leverage existing ontologies to obtain initial examples of particular concepts and expand them according to the particular task at hand, employing a human-in-the-loop approach.

A dictionary is an ever-evolving artifact

There are several examples in daily life where the need of constantly updating concepts is critical. A familiar case is the one of online shops that must integrate new product descriptions provided by vendors on a daily basis. The features and vocabulary used to describe the products continuously evolve, with different vendors providing the product descriptions in varied writing styles and standards. Despite these differences, to fully integrate new products (e.g., be able to provide meaningful comparison shopping grids), merchants must correctly identify and assign equivalences to all these instances.

The evolution of dictionaries is not confined to products (or other naturally growing sets). Even concepts that we would assume as simple and stable, for example color names, are constantly evolving. The way color names evolve in different languages can be quite dissimilar, given the cultural differences in how we express them in different countries. For instance, a new color name, mizu, has recently been proposed for addition in the list of Japanese basic color terms. On a more practical level, capturing the right instances for a concept can also be highly task-dependent: as our users learned during the experiment, they discovered “space gray,” “matte black” and “jet black” are all relevant colors for mobile phones, while “white chocolate” or “amber rose” are colors of wall paint products.


This image represents an exemplar dictionary in Spanish. It’s arranged in the shape of a “thumbs up” to symbolize the human-in-the-loop that approves/rejects the system’s moves.

Our goal is to design an AI training technique for concept expansion and maintenance which is (i) completely language independent, (ii) combines statistical methods with human-in- the-loop and (iii) exploits Linked Data as bootstrapping source. We carried experiments on a publicly available medical corpus and on a Twitter dataset and demonstrate that we can achieve comparable performances regardless of language, domain and style of text.

The full details of our experiments are available in: “Multi-lingual Concept Extraction with Linked Data and Human-in-the-loop” (Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski and Steve Welch) which will be presented at K-CAP 2017 (on the 6th of December 2017). The preliminary ideas of this work were also discussed at ISWC 2017 (October 2017): “Language Agnostic Dictionary Extraction.” (Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski and Steve Welch).

There is still a lot to be done. And currently we are working on reducing the number of user operations needed by making a more in-depth usage on available Linked Data and noise reduction techniques.


More AI stories

We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page:

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading