AI

Automated knowledge base construction solution wins at ISWC 2017

Share this post:

Automated knowledge base construction is a long-standing challenge in AI. The goal is to abstract concise representations from various sources of knowledge, such as unstructured documents, web data and knowledge bases. The outcome is a knowledge graph that can be used to enhance downstream applications like search engines and business analytics. Highly accurate and extensive knowledge graphs are the prerequisite to enable machine reasoning and decision making in AI.

For example, knowledge base construction was an essential component of the DeepQA system that defeated the human grand-champions at Jeopardy! and has since been a very active research direction as it enables the adaptation of AI solutions to new domains.

To advance this field, IBM Research established an organization focusing on knowledge induction. I have had the pleasure of leading this team for a few years now. One of our main goals is to develop a powerful automated knowledge extraction engine that allows the ingestion of large corpora and structured data in a specific domain and generates useful insights from them. Our framework, called Socrates, is a web-scale knowledge extraction engine that exploits a mix of deep learning, semantic web technology and natural language processing to understand information in text and integrate knowledge extracted from various sources. This architecture is generally applicable to a wide variety of knowledge extraction and curation tasks, such as automated knowledge base completion, entity linking and dictionary extraction.

Recently, our group collaborated with the team behind the enterprise search product IBM Watson Discovery Service (WDS) to improve knowledge induction and help WDS more quickly adapt to different domains. The product use cases impose strict requirements on the research solutions, such as working with limited training data, developing user-friendly, real-time solutions, dealing with “noisy” document collections and recognizing specific entity types for which pre-existing training data and analytics are not available.

Encouraged by their performance, we decided to test Socrates in an open evaluation challenge: the International Semantic Web Conference (ISWC) Semantic Web Challenge 2017. The results exceeded our expectation, as we obtained the best performances in both the Attribute Prediction and Attribute Validation Tasks.

knowledge base construction

As training data, ISWC organizers provided the publically-available portion of a knowledge base distributed by Thompson Reuters KB called permid.org that contains details on hundreds of thousands of companies. The challenge is to identify information about the companies contained in the private part of the same knowledge base, such us their foundation dates and telephone numbers. The target companies are located all over the world, often very small and, in many cases, the only information available about them are their websites, expressed in a variety of different languages. To make the task even harder, different companies might have different names and sometimes the same company has different subsidiaries.

We applied the Socrates framework to this task, hybridizing deep learning and automated knowledge base construction solutions. Socrates adopts a distant supervision training methodology: the system is trained using examples of interesting facts about known companies in the form of triples extracted from knowledge bases and doesn’t require humans to label their mentions in text. The distant supervision is done by matching their occurrence in corresponding web pages to automatically generate training data for a deep learning system that identifies meaningful patterns in the data. As a result of the training process, the system learns how to recognize new facts from companies’ websites based on those provided at training phase.

The winning system can be generalized to fit most use cases where text about unknown entities is provided as an input and a pre-existing knowledge base is available. This is a very common use case. However, our ambition for this project goes beyond the scope of traditional knowledge base population, where a large amount of training data is available. Our aim is to push the boundary of the field focusing on minimally supervised approaches that exploit transfer learning technology to take advantage of pre-existing knowledge in other domains.

In addition, knowledge base population technology should be responsive to allow the user to dynamically specify new models for the data in new domains. To this aim, we are envisioning information retrieval-based solutions for the knowledge induction process capable of generating a large amount of candidate entity pairs from simple user queries that can be analyzed in real time using analogical reasoning. Finally, and more importantly, most of the interesting information contained in text is usually implicit, requiring reasoning and reading comprehension capabilities to identify them. Our next step will be to explore the field in all of these directions.

This effort was lead by Michael Glass (IBM Research) and involved Oktie Hassanzadeh, Alfio Massimiliano Gliozzo (both from IBM Research) and Nandana Mihindukulasooriya (IBM Research intern from Universidad Politécnica de Madrid).

Save

Save

Save

Save

Save

Save

Save

More AI stories

Using SecDevOps to design and embed security and compliance into development workflows

IBM Research has initiated focused efforts called Code Risk Analyzer to bring security and compliance analytics to DevSecOps. Code Risk Analyzer is a new feature of IBM Cloud Continuous Delivery, a cloud service that helps provision toolchains, automate builds and tests, and control quality with analytics.

Continue reading

IBM Research and the Broad Institute Seek to Unravel the True Risks of Genetic Diseases

In 2019, IBM and the Broad Institute of MIT and Harvard started a multi-year collaborative research program to develop powerful predictive models that can potentially enable clinicians to identify patients at serious risk for cardiovascular disease (1, 2). At the start of our collaboration, we proposed an approach to develop AI-based models that combine and […]

Continue reading

Impact of the SQL Relational Model 50 years later

Fifty years ago this month, IBM researcher and computing pioneer Edgar Frank Codd published the seminal paper “A Relational Model of Data for Large Shared Data Banks,” which became the foundation of Structured Query Language (SQL), a language originally built to manage structured data with relational properties. Today SQL is one of the world’s most […]

Continue reading