IBM Research at VLDB 2020

Share this post:

VLDB 2020 will be a virtual conference this year and will take place August 31-September 4. IBM Research AI is a gold sponsor and will have a strong presence at VLDB 2020, with technical papers, tutorials, as well as workshop organization.

The papers address various topics, ranging from adding graph analytics to data bases, to natural language interfaces to relational data, and HTAP systems. These papers are (co)-authored by researchers working across numerous IBM Research labs around the world.

We hope you will join us at our  virtual booth on September 1st at 6 pm PDT for a live chat with our researchers about our latest research, career opportunities, internships including the AI Residency Program.

For a full list of our papers, demos, tutorials and workshops, see below.

ATHENA++: Natural Language Querying for Complex Nested SQL Queries

Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou , Ayushi Dalmia , Greg Stager ,Ashish Mittal ,Dipti Saha, and Karthik Sankaranarayanan

In this paper, we present Athena++, an end-to-end NLIDB (Natural Language Interface to DataBase) system that can translate complex analytic queries expressed in natural language into nested SQL queries. In particular, Athena++ combines linguistic patterns from NL queries with deep domain reasoning using ontologies to enable nested query detection and generation. We also introduce a new benchmark data set, FIBEN, which consists of 300 NL queries corresponding to 237 distinct complex SQL queries in a database with 152 tables, conforming to an ontology derived from standard financial ontologies (FIBO and FRO). We conducted extensive experiments comparing Athena++, with two state-of-the-art NLIDB systems, using both FIBEN and the Spider benchmark, and show that Athena++, consistently outperforms both systems across all benchmark data sets with a wide variety of complex queries.

The FIBEN benchmark has been open-sourced at

Many-Core Clique Enumeration with Fast Set Intersections

Jovan Blanuša; Radu I Stoica; Paolo Ienne; Kubilay Atasu

In this paper, we prove that the use of a hash-join-based set-intersection algorithm within Maximal Clique Enumeration (MCE) leads to Pareto-optimal implementations in terms of runtime and memory space compared to those based on merge joins. Building on this theoretical result, we develop a scalable parallel implementation of MCE that exploits both data parallelism, by using SIMD-accelerated hash-join-based set intersections, and task parallelism, by using a shared-memory parallel processing framework that supports dynamic load balancing. Overall, our implementation is an order of magnitude faster than a state-of-the-art manycore MCE algorithm. We also show that a careful scheduling of the execution of the tasks leads to a two orders of magnitude reduction of the peak dynamic memory usage.

ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation

Roee Shraga, Avigdor Gal, Haggai Roitman

In this paper, we offer a novel post processing step to schema matching that improves the final matching outcome without human intervention. We present a new mechanism, similarity matrix adjustment, to calibrate a matching result and propose an algorithm (dubbed ADnEV) that manipulates, using deep neural networks, similarity matrices, created by state-of-the-art algorithmic matchers. ADnEV learns two models that iteratively adjust and evaluate the original similarity matrix. We show that ADnEV can generalize into new domains without the need to learn the domain terminology, thus allowing cross-domain learning. We also show ADnEV to be a powerful tool in handling schemata which matching is particularly challenging. Finally, we show the benefit of using ADnEV in a related integration task of ontology alignment.

Distributed Edge Partitioning for Trillion-edge Graphs

Masatoshi Hanai, Toyotaro Suzumura, Wen Jun Tan, Elvis Liu, Georgios Theodoropoulos, Wentong Cai

We propose Distributed Neighbor Expansion (Distributed NE), a parallel and distributed graph partitioning method that can scale to trillion-edge graphs while providing high partitioning quality. Distributed NE is based on a new heuristic, called parallel expansion, where each partition is constructed in parallel by greedily expanding its edge set from a single vertex in such a way that the increase of the vertex cuts becomes local minimal. We theoretically prove that the proposed method has the upper bound in the partitioning quality. The empirical evaluation with various graphs shows that the proposed method produces higher quality partitions than the state-of-the-art distributed graph partitioning algorithms.

Db2 Event Store: A Purpose-Built IoT Database Engine

Christian Garcia-Arellano, Adam Storm, David Kalmuk, Hamdi Roumani, Ronald Barber, Yuanyuan Tian, Richard Sidle,  Fatma Ozcan, Matthew Spilchen, Josh Tiefenbach, Daniel Zilio, Lan Pham, Kostas Rakopoulos, Alexander Cheung, Darren Pepper, Imran Sayyid, Gidon Gershinsky, Gal Lushi, Hamid Pirahesh

In this paper we present IBM Db2 Event Store, a cloud-native database system designed specifically for IoT workloads, which require extremely high-speed ingest, efficient and open data storage, and near real-time analytics. Additionally, by leveraging the Db2 SQL compiler, optimizer and runtime, developed and refined over the last 30 years, we demonstrate that rearchitecting for the public cloud doesn’t require rewriting all components. Reusing components that have been built out and optimized for decades dramatically reduced the development effort and immediately provided rich SQL support and excellent run-time query performance.

Conversational BI: An Ontology-Driven Conversation System for Business Intelligence Applications

Abdul H Quamar, Fatma Ozcan, Dorian Miller, Robert Moore, Rebecca Niehus, Jeffrey Kreulen

In this paper, we describe an ontology-based framework for creating a conversation system for BI applications termed as Conversational BI. We create an ontology from a business model underlying the BI application and use this ontology to automatically generate various artifacts of the conversation system. These include the intents, entities, as well as the training samples for each intent. Our approach builds upon our earlier work, and exploits common BI access patterns to generate intents, their training examples and adapt the dialog structure to support typical BI operations. We have implemented our techniques in Health Insights (HI), an IBM Watson Healthcare offering, providing analysis over insurance data on claims. Our user study demonstrates that our system is quite intuitive for gaining business insights from data.

Replication at the Speed of Change – a Fast, Scalable Replication Solution for Near Real-Time HTAP Processing

Dennis Butterstein, Daniel Martin, Knut Stolze, Felix Beier, Jia Zhong, Lingyun Wang

The IBM Db2 Analytics Accelerator (IDAA) is a state-of-the art hybrid database system that seamlessly extends the strong transactional capabilities of Db2 for z/OS with the very fast column-store processing in Db2 Warehouse. Data can be synchronized at a single point in time with a granularity of a table, one or more of its partitions, or incrementally as rows changed using replication technology. In this paper, we present how Integrated Synchronization is capable of delivering performance improved by factors to pave the way for near real-time Hybrid Transactional and Analytical (HTAP) processing.

Improving Reproducibility of Data Science Pipelines through Transparent Provenance Capture

Lukas Rupprecht, James C Davis, Constantine Arnold, Yaniv Gur, Deepavali Bhagwat

In this paper, we present Ursprung, a transparent provenance collection system designed for data science environments. The Ursprung philosophy is to capture provenance and build lineage by integrating with the execution environment to automatically track static and runtime configuration parameters of data science pipelines. Rather than requiring data scientists to make changes to their code, Ursprung records basic provenance information from system-level sources and combines it with provenance from application-level sources (e.g., log files, stdout), which can be accessed and recorded through a domain-specific language. In our evaluation, we show that Ursprung is able to capture sufficient provenance for a variety of use cases and only adds an overhead of up to 4%.

Tutorial: Table Extraction and Understanding for Scientific and Enterprise Applications

Douglas Burdick, Alexandre Evfimievski, Nancy (Xin Ru) Wang, Yannis Katsis, Marina Danilevsky

Valuable high-precision data are often published in the form of tables in both scientific and business documents. While humans can easily identify, interpret and contextualize tables, developing general-purpose automated techniques for extraction of information from tables is difficult due to the wide variety of table formats employed across corpora. Table extraction involves identifying the border and cell structure for each document table, while table understanding provides context by linking cells with semantic information inside and outside the table, such as row and column headers, footnotes, titles, and references in surrounding text. The objective of this tutorial is to provide a detailed synopsis of existing approaches for table extraction and understanding, highlight open research problems, and provide an overview of potential applications.


In addition to the presentations at the main conference, IBM researchers are also co-organizing the following workshops on August 31st:

Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS 2020)

Rajesh Bordawekar, Tirthankar Lahiri

The objective of this one-day workshop is to investigate opportunities in accelerating data management systems and analytics workloads using both commodity and specialized processors, such as GPUs, FPGAs, and ASICs/SOCs, high-speed networking, different storage systems, both on on-prem as well as cloud systems.

Applied AI for Database Systems and Applications (AIDB 2020)

Berthold Reinwald, Bingsheng He, Yingjun Wu

The goal of this workshop is to bring researchers and practitioners from AI and database communities to investigate new research opportunities in the design and implementation of database systems and applications with emerging AI technologies.

The 1st Workshop on Distributed Infrastructure, Systems, Programming and AI (DISPA 2020)

Kazuaki Ishizaki, Barzan Mozafari, Matei Zaharia

The goal of the Distributed Infrastructure Systems, Programming and AI (DISPA) workshop is to bring together researchers and practitioners from distributed systems, databases, programming languages, and machine learning communities to investigate novel distributed system designs.


*In addition to our  demos featured at the IBM booth, you can try IBM Research Experiments here.

IBM Db2 Graph — Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2.

ExBERT – A Visual Tool to Explore BERT: Learn how to uncover insights into what deep transformer models understand about human language by interactively exploring their learned attentions and contextual embeddings.

AutoAI for Time Series — This demo shows time series forecasting using AutoAI which automatically selects and optimizes statistics and machine learning pipelines.

Explainable Link Prediction — Link Prediction using Graph Neural Networks for Master Data Management. 

Partner-Human-in-the-Loop Entity Name Understanding with Deep Learning.

SystemER-A Human-in-the-loop System for Explainable Entity Resolution.

Distinguished Research Staff Member, Senior Manager-Hybrid Data, IBM Research

More AI stories

Using machine learning to solve a dense hydrogen conundrum

Hydrogen is the simplest element in the universe, yet its behavior in extreme conditions such as very high pressure and temperature is still far from being well understood. Dense hydrogen constitutes the bulk of the content of giant gas planets and brown dwarf stars and it’s a material of interest for both fundamental physics and […]

Continue reading

Gauteng Province Launches COVID-19 Dashboard Developed by IBM Research, Wits University and GCRO – Now Open to the Public

The Gauteng Province has been using data and cloud technologies to monitor and respond to Covid-19, and now they are sharing access with the public. As of 20 August the Gauteng Province in South Africa has 33% of the national cases for COVID-19 with 202,000 confirmed cases — and the numbers continue to rise. To address […]

Continue reading

IBM Research AI at KDD 2020

At KDD 2020, IBM Research AI presents work that explores topics ranging from healthcare to forecasting, human-centered explainability, optimization, graph representation and automated machine learning.

Continue reading