My team at IBM Research-Almaden has created a novel “human-in-the-loop” approach for AI dictionary expansion. Dictionaries and ontologies are foundational elements of systems that extract knowledge from unstructured text, such as natural language processing, information extraction, and many other AI applications. Keeping dictionaries up to date as new content arrives is a crucial operation. We propose a human-in-the-loop (HumL) dictionary expansion approach that employs a lightweight neural language model coupled with tight HumL supervision to assist the user in building and maintaining a domain-specific dictionary from an input text corpus. We describe this approach in our paper “Interactive Dictionary Expansion using Neural Language Models,”  which will be presented at the second international workshop on Augmenting Intelligence with Humans-in-the-Loop co-located with ISWC 2018 in Monterey, California, in October 2018.
We previously described how tight collaboration between humans and AI systems results in significantly improved performance [2,3,4]. In fact, in many domains near-perfect performance is required. For example, many biomedical applications have near zero percent tolerance, despite datasets being noisy and incomplete. Furthermore, some problems in the medical domain are quite challenging, making the application of fully automated models difficult, or at least raising questions on the quality of results. Consequently, efficiently including a domain expert as an integral part of the system not only greatly enhances the knowledge discovery process pipeline, but can in certain circumstances be legally or ethically required.
The explore/exploit paradigm
Our approach is based on the explore/exploit paradigm to effectively discover new instances (explore) from the text corpus as well as predict new “un-seen” terms not currently in the corpus using the accepted dictionary entries (exploit). We test our approach on a real-world scenario in the healthcare domain, in which we construct a dictionary of adverse drug reactions from user blogs as input text corpus. The evaluation shows that using our approach a user can easily extend the input dictionary, where tight HumL integration results in a 216 percent improvement in effectiveness.
In our paper we propose a feature-agnostic approach for dictionary expansion based on lightweight neural language models, such as word2vec . To prevent semantic drift during the dictionary expansion, we effectively include HumL. Given an input text corpus and a set of seed examples, the proposed approach runs in two phases, explore and exploit, to identify new potential dictionary entries.
During the explore phase, our model tries to identify instances in the input text corpus that are similar to the dictionary entries, using term vectors from the neural language model to calculate a similarity score. During the exploit phase, the model tries to construct more complex multi-term phrases based on the instances already in the input dictionary. Multi-term phrases are a challenge for word2vec style systems as they need to be “known” prior to model creation. To identify multi-term phrases, most commonly a simple phrase detection model is used, which is based on a term’s co-occurrence score, i.e., terms that often appear together probably are part of the same phrase.
The phrase detection must be done before the model is built and remains unchanged after the model is built. However, depending on the domain and the task, the instances of interest evolve, or the example corpus may not be complete. For example, valid phrase combinations may simply not occur (e.g., “acute joint pain” may appear in the sample corpus, but “chronic hip pain” may not). However, these phrases are likely to occur in future texts from the same source, and thus are important to include in any entity extraction lexicon.
In the exploit phase, the model generates new phrases by analyzing the single terms of the instances in the input dictionary. We use two phrase-generation algorithms: (i) one modifies the phrases by replacing single terms with similar terms from the text corpus, e.g., “abnormal behavior” can be modified to “strange behavior”; (ii) the other extends the instances with terms from the text corpus that are related to the terms in the instance, e.g., “abnormal blood clotting problems” may not appear in a large text corpus, but “abnormal blood count”, “blood clotting” and “clotting problems” appear several times in the corpus and can be used to build the more complex instance. This approach allows us to construct new multi-term instances that don’t appear as such in the text corpus but are supported by statistical evidence to suggest that they might be of interest for the user.
This characteristic is critical to help “future-proof” a lexicon. For a surveillance application (e.g., drug side effects mentioned on Twitter) it reduces how frequently a human needs to “tune up” the lexicon to make sure it is catching all relevant entity instances.
Combining the explore and exploit approaches in an unsupervised fashion (or an infrequently supervised fashion) is not particularly effective. It tends to generate many spurious results that the human expert needs to wade through. Close supervision, however, results in a much more performant system. The evaluation shows that tighter AI/human collaboration results in nearly perfect performance of the system, i.e., nearly all the candidates identified by the system are valid entries in the dictionary.
The AI/human collaboration is essential for our approach to be successful. We run the algorithm on a small number of entities to ensure consistency and let humans provide feedback before training on larger entries.
We believe our AI algorithms can be applied to multiple industries. The algorithm is being tested in IBM’s Call Center for classifying incoming calls. IBM Watson is exploring our approach for extending a list of entities to increase its coverage when using domain specific corpus. Also, the insurance industry is interested in applying our approach to classifying contracts and analyzing reports.
- Alba, A., Gruhl, D., Ristoski, P., Welch, S. Interactive Dictionary Expansion using Neural Language Models. Second international workshop on Augmenting Intelligence with Humansinthe-Loop (2018).
- Coden, A., Gruhl, D., Lewis, N., Tanenblatt, M., Terdiman, J. SPOT the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora. Proceedings – 2012 IEEE 2nd Conference on Healthcare Informatics, Imaging and Systems Biology, pp. 33–39 (2012).
- Alba, A., Coden, A., Gentile, A.L., Gruhl, D., Ristoski, P., Welch, S. Multi-lingual concept extraction with linked data and human-in-the-loop. Proceedings of the Knowledge Capture Conference, p. 24 (2017).
- Clarkson, K., Gentile, A.G., Gruhl, D., Ristoski, P., Terdiman, J., Welch, S. User-Centric Ontology Population. European Semantic Web Conference, pp. 112–127 (2018).
- Mikolov, T. et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (2013).