Automated knowledge base construction is a long-standing challenge in AI. The goal is to abstract concise representations from various sources of knowledge, such as unstructured documents, web data and knowledge bases. The outcome is a knowledge graph that can be used to enhance downstream applications like search engines and business analytics. Highly accurate and extensive knowledge graphs are the prerequisite to enable machine reasoning and decision making in AI.
For example, knowledge base construction was an essential component of the DeepQA system that defeated the human grand-champions at Jeopardy! and has since been a very active research direction as it enables the adaptation of AI solutions to new domains.
To advance this field, IBM Research established an organization focusing on knowledge induction. I have had the pleasure of leading this team for a few years now. One of our main goals is to develop a powerful automated knowledge extraction engine that allows the ingestion of large corpora and structured data in a specific domain and generates useful insights from them. Our framework, called Socrates, is a web-scale knowledge extraction engine that exploits a mix of deep learning, semantic web technology and natural language processing to understand information in text and integrate knowledge extracted from various sources. This architecture is generally applicable to a wide variety of knowledge extraction and curation tasks, such as automated knowledge base completion, entity linking and dictionary extraction.
Recently, our group collaborated with the team behind the enterprise search product IBM Watson Discovery Service (WDS) to improve knowledge induction and help WDS more quickly adapt to different domains. The product use cases impose strict requirements on the research solutions, such as working with limited training data, developing user-friendly, real-time solutions, dealing with “noisy” document collections and recognizing specific entity types for which pre-existing training data and analytics are not available.
Encouraged by their performance, we decided to test Socrates in an open evaluation challenge: the International Semantic Web Conference (ISWC) Semantic Web Challenge 2017. The results exceeded our expectation, as we obtained the best performances in both the Attribute Prediction and Attribute Validation Tasks.
As training data, ISWC organizers provided the publically-available portion of a knowledge base distributed by Thompson Reuters KB called permid.org that contains details on hundreds of thousands of companies. The challenge is to identify information about the companies contained in the private part of the same knowledge base, such us their foundation dates and telephone numbers. The target companies are located all over the world, often very small and, in many cases, the only information available about them are their websites, expressed in a variety of different languages. To make the task even harder, different companies might have different names and sometimes the same company has different subsidiaries.
We applied the Socrates framework to this task, hybridizing deep learning and automated knowledge base construction solutions. Socrates adopts a distant supervision training methodology: the system is trained using examples of interesting facts about known companies in the form of triples extracted from knowledge bases and doesn’t require humans to label their mentions in text. The distant supervision is done by matching their occurrence in corresponding web pages to automatically generate training data for a deep learning system that identifies meaningful patterns in the data. As a result of the training process, the system learns how to recognize new facts from companies’ websites based on those provided at training phase.
The winning system can be generalized to fit most use cases where text about unknown entities is provided as an input and a pre-existing knowledge base is available. This is a very common use case. However, our ambition for this project goes beyond the scope of traditional knowledge base population, where a large amount of training data is available. Our aim is to push the boundary of the field focusing on minimally supervised approaches that exploit transfer learning technology to take advantage of pre-existing knowledge in other domains.
In addition, knowledge base population technology should be responsive to allow the user to dynamically specify new models for the data in new domains. To this aim, we are envisioning information retrieval-based solutions for the knowledge induction process capable of generating a large amount of candidate entity pairs from simple user queries that can be analyzed in real time using analogical reasoning. Finally, and more importantly, most of the interesting information contained in text is usually implicit, requiring reasoning and reading comprehension capabilities to identify them. Our next step will be to explore the field in all of these directions.
This effort was lead by Michael Glass (IBM Research) and involved Oktie Hassanzadeh, Alfio Massimiliano Gliozzo (both from IBM Research) and Nandana Mihindukulasooriya (IBM Research intern from Universidad Politécnica de Madrid).