AI

DualTKB: A Dual Learning Bridge between Text and Knowledge Base

Share this post:

Capturing and structuring common knowledge from the real world to make it available to computer systems is one of the foundational principles of IBM Research. The real-world information is often naturally organized as graphs (e.g., world wide web, social networks) where knowledge is represented not only by the data content of each node, but also by the manner these nodes connect to each other. For example, the information in the previous sentence could be represented as the following graph:

Graph representation of knowledge is a powerful tool to capture information around us and enables the creation of Knowledge Bases (KBs) encompassing information as Knowledge Graphs (KGs) so computer systems can efficiently learn from.

Being able to process information embedded within knowledge graphs natively is part of IBM Research effort to create the foundations of Trusting AI, where we build and enable AI solutions people can trust.

IBM Research is particularly interested in the task of knowledge transfer from a Knowledge Graph to a more accessible, human-readable modality, such as text. Text is a natural medium for us humans to acquire knowledge by learning new facts, new concepts, new ideas.

IBM Research Introduces DualTKB at EMNLP 2020

In new work being presented at EMNLP’20 1, a team of IBM researchers 2 explored how to create a learning bridge between text and knowledge bases by enabling a fluid transfer of information from one modality to another. The approach is bi-directional, allowing for a dual learning framework that can generate text from KB and KB from text.

1 Presentation will be held online at the EMNLP’20 Gather Session 5D “Information Extraction” on 11/18/2020 at 1pm (UTC-5) US East Coast Time. A version of our EMNLP’20 paper can be found online at https://arxiv.org/abs/2010.14660

2 Team is composed of Pierre Dognin, Igor Melnyk, Inkit Padhi, Cicero Nogueira dos Santos (now at AWS AI), and Payel Das.

Approach to DualTKB

DualTKB is a novel approach to define a dual learning bridge between text and Knowledge Base. DualTKB can ingest sentences and generate corresponding paths in a KB (to tackle this challenging task, we first describe the case of generating one path from/to one sentence).

Since we designed DualTKB to be bi-directional, our approach has the ability to translate (or transfer) to and from modalities natively. Translation cycles can therefore be defined to enforce consistency in generation. For instance, DualTKB can transfer a sentence to the KB domain, take this generated path and translate it back to the text domain. This is key to the dual learning process where translation cycles must return to the original domain with semantic consistency (in our example, we should translate back to a sentence that is either the original sentence, or a sentence semantically very close to it).

These consistency translation cycles were originally motivated by the lack of parallel data for cross-domain generation tasks. By relying on these transfer cycles, our approach handles unsupervised settings natively. In the diagram below, we give examples of all the translation cycles that DualTKB can handle. For instance, from the Text domain, we can transfer to the KB domain, using TAB,  and then translate-back to text by using TABA. We can also do a same-domain transfer by using TAA.

 

IBM Research’s Model

The proposed model follows an encoder-decoder architecture, where the encoder projects text xA   or path in a graph xB  to a common high dimensional representation. It is then passed through a specialized decoder (Decoder A or Decoder B) which can either do a reconstruction of the same modality (auto-encoding) xAA or xBB  or do the transfer to a different modality  xAB or xBA .

Traditionally, such systems are trained through a seq2seq (sequence to sequence) supervised learning where paired text-path data is used to enable same- and cross-modality generation. Unfortunately, many of the real-world KB datasets do not have the correspondent pairs of text and path, requiring unsupervised training techniques.

For this purpose we propose to augment the traditional supervised training, when parallel data is present, with unsupervised training when no paired data is available. The overall training process is shown in the diagram below.

The training process can be decomposed into a translation stage (LREC, LSUP) followed by a back-translation stage (LBT). Inherently, the model can natively deal with both supervised and unsupervised data. A more detailed description on the training method can be found in the paper.

IBM Research’s Dataset

Given the lack of supervised dataset for the task of KB-Text cross-domain translation, one of our contributions is the creation of a dataset based on the widely used ConceptNet KB and Open Mind Common Sense (OMCS) list of commonsense fact sentences. Since ConceptNet was derived from OMCS, we decided to employ fuzzy matching techniques to create a weakly supervised dataset by mapping ConceptNet to OMCS. This resulted in a parallel dataset of 100K edges for 250K sentences. We are planning to release details and describe the heuristics involved in creating this dataset by fuzzy matching. In the meantime, some information can be found on our website  https://github.com/IBM/dualtkb, and in our paper.

Experiments

We compared the performance of DualTKB to published prior-work baselines for the task of link prediction.The table below shows results of  MRR and HITS, both well-established metrics for evaluating the quality of link completion. DualTKB compares favorably with the competitors, enabling accurate link completion.

For a more qualitative evaluation of cross-domain generation, we provide examples of transfer from multiple sentences to paths (composing a graph) on the left, and on the right the reverse operation of sentence generation from a list of given paths.

Future work

Our current research opens the door to multiple exciting future directions:

  1. A natural continuation of our work, which now deals with the transfer between a single sentence to/from a single path, is its extension to the generation of large multi-path graph structures given short paragraphs of sentences, or the reverse problem of KB conversion to a coherent textual descriptions.
  2. Another direction of investigation is the development of class conditional generation of large text conditioned on the facts from a Knowledge Base.
  3. In the field of Trusted AI, we are also interested in extending this work on trusted generation of text and factual check of text given a Knowledge Base.

A version of our EMNLP’20 paper can be found online at https://arxiv.org/abs/2010.14660

IBM Researchers involved with this work are Pierre Dognin, Igor Melnyk, Inkit Padhi, Cicero Nogueira dos Santos (now at AWS AI), and Payel Das.

References

[1] Bishan Yang, Wen-tau Yih, Xiaodong He, Jian-feng Gao, and Li Deng. 2015.Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In Proceedings of the International Conference on Learning Representations (ICLR).

[2] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016.Complex Embeddings for Simple Link Prediction. In ICML, pages 2071–2080

[3] Tim Dettmers, Pasquale Minervini, Pontus Stene-torp, and Sebastian Riedel. 2018.Convolutional2D Knowledge Graph Embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence

[4] Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen Zhou. 2019. End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion. In AAAI.

[5] Chaitanya Malaviya, Chandra Bhagavatula, Antoine Bosselut, and Yejin Choi. 2020. Commonsense Knowledge Base Completion with Structural and Semantic Context. Proceedings of the34th AAAI Conference on Artificial Intelligence.

Inventing What’s Next.

Stay up to date with the latest announcements, research, and events from IBM Research through our newsletter.

 

More AI stories

IBM Research at EMNLP 2020

At the annual Conference on Empirical Methods in Natural Language Processing (EMNLP), IBM Research AI is presenting 30 papers in the main conference and 12 findings that together aim to advance the field of natural language processing (NLP).

Continue reading

The Rensselaer-IBM Artificial Intelligence Research Collaboration advances breakthroughs in more robust and secure AI

Launched in 2018, the Rensselaer-IBM Artificial Intelligence Research Collaboration (AIRC) is a multi-year, multi-million dollar joint venture boasting dozens of ongoing projects in 2020-2021 involving more than 80 IBM and RPI researchers working to advance AI.

Continue reading

IBM co-organizes ISWC Semantic Web Challenge on answer type prediction

The International Semantic Web Conference (ISWC) 2020, the premier international forum for the Semantic Web and Linked Data Community, is being held November 1 - 6, 2020. IBM Research AI is proud to participate in this conference as a platinum sponsor.

Continue reading