Big Data

Self-Learning Data Catalogs – Savior of Data Lakes or Populist Promise?

Share this post:

Self-Learning Data Catalogs – Savior of Data Lakes or Populist Promise?

In an increasingly complicated world you would like to see easy solutions for difficult problems. Is a Data Catalog the solution or an inflated promise when companies meet Big Data challenges?


The development of artificial intelligence (AI) requires high-quality and versatile data. Not even excellent data helps, if it is hidden in information systems and documents. In this blog, I briefly describe the history of data warehousing and analytics leading to Master Data Management (MDM) solutions. Then I view the principles and value proposition of a Data Catalog, visit the year 2044 and finally try to answer the question of the main title.

From Data Warehousing to Artificial Intelligence

Companies and the public sector have tried to meet the challenge of data chaos since there has been operative and administrative systems. In the end of the 1990´s relational data warehouses together with business intelligence tools for reporting and simple analytics became common. Data mining was the tool for more advanced analytics and the favourite technology of statisticians, actuaries and research scientists.

Data warehouse (DW) projects met challenges, when data had to be extracted from several sources. The technical synchronisation of data was relatively easy, when e.g. the name of a customer organisation was 20 characters in a financial management system and 30 characters in a delivery system. ETL tools (Extract, Transform and Load) were developed for the synchronisation and other transformation of data. ETL tools were flexible and easy to use, even with complicated operations when compared to coding, which was used previously. In a typical DW project to build ETL jobs was a major part of the workload. Many customers have estimated it to reach 80% of the whole project.

It was even more difficult to manage content related differences: a business term customer, product etc. may have a meaning originating from a source system and the meaning varies between different sources. The single version of truth was needed and somehow it had to be fed back to all systems using it. Master Data Management (MDM) solutions were developed for that purpose.

Data CatalogImage: Visualisation of customer networking in MDM system

MDM development projects met major challenges when you needed to integrate various processes and businesses. An example of this is a private customer address change in an enterprise having both a banking and insurance business. Due to data security reasons, the new address of a banking customer can only be processed through a strong identification procedure. The new address can then be loaded into insurance systems with no major challenges. However, the insurance business requires that its customers should have an easy way to address changes like a phone call with no strong identification procedures. This means that the changed address of insurance systems cannot be copied into banking systems without compromising data security.

In addition to big data volumes (Big Data) various data types and sources have become common during the past few years. In parallel to traditional transactional data of operative systems there is the flow of new data from text, sensor, audio, video and IoT sources. The new data is seldom as structured and easy to analyse as the transactions of core business systems.

Data Lakes and Artificial Intelligence

Data Lakes were developed to integrate various data types in a better way than DW´s or even MDM could do. The first impression was that an ideal solution was invented, particularly if you believed that Hadoop technology would do the integration and synchronisation for you. Feedback from users soon revealed that Data Lake did not meet the expected result of a simple and easy solution. With no proper governance, metadata and user management a Data Lake becomes a data swamp. However, Data Lake is a good reference architecture if you remember that a disciplined governance is required in direct proposition to data volumes and types. More information: Click here!

In addition to new data types and big volumes, the analytics demand is accelerated by the promise of new technology opportunities. Cognitive solutions imitate the senses and are supported by machine learning. The systems composed of the new technologies can even be used to create a system, which seems to be intelligent. An army of ants is a good example of this as it can be amazingly capable, and its total intelligence seems to be more than the sum of individuals – this is called swarm intelligence. But if you study artificial intelligence on a code level, you won´t find more wisdom in an algorithm than in the head of an ant.

Data Catalog

Image: Swarm intelligence power for artificial intelligence processing

Data scientists, a new category of specialists, was born to exploit the business opportunity based on deep analytics and AI. They should understand statistics, business needs and new tools including machine learning, cognitive systems and other areas of AI. Due to the high qualifications it is very difficult to find and recruit a data scientist, but when you finally have one, you soon notice that 80% of his or her time is used for the searching and preparing of data instead of productive analysis.

Data Catalog – new promise for saving Data Lakes

Data Catalog is a new candidate to save Data Lakes and improve the performance of data scientists. Contrary to data warehousing, only the information about data is stored in a Data Catalog. You could compare this to a library catalogue storing the headers, categories, editors etc., but not the content of books. The catalogue content is called metadata, which covers both the technical and business description of data. The marketing message of a typical Data Catalog promises that it offers an easy interface to all essential data. The most advanced products also promise to manage data security.

Is a Data Catalog just one more promise to meet the never-ending challenges of data warehousing or is it a magic tool finally integrating all kinds of data? At least the value proposition is similar with many tools in data warehousing history, which promised to liberate the end users like ad hoc reporting, Online Analytical Processing (OLAP) and dashboards. Data Catalog solutions have existed for a long time. Why didn´t Data Catalogs have their breakthrough until now?

The pioneers in the market, like IBM, first developed solutions to manage technical metadata. A business glossary was planned to be the missing link between the business and technical management of metadata. Despite the business glossary, the products still mirrored the era of data warehousing, because they mostly used structured data sources.

During the past few years, new Data Catalog solutions for end users have popped up. They exploit fresh features like recommended data sources (Netflix), user reviews (TripAdvisor) and machine learning to support metadata creation. Recent data sources (social media, IoT, documents etc.) have also been covered. They have a visual interface and implementation should be easy with no major IT development effort needed. At least if you listen to the marketing messages of the vendors. Besides, they can be used to analyse text sources like documents, emails or chat bots.

While evaluating Data Catalog products you should also remember the opportunities of traditional MDM systems. Integrated hybrid (Data Catalog and MDM) solutions can exploit standardised master data terms to support metadata management. In MDM applications you can also find advanced statistical deduction mechanisms, which can be used to support the self-learning processes of a Data Catalog.

Visit to the future

It is February 2044 and the first virtual company has been listed in Nasdaq Helsinki Stock Exchange. Its strategy, operations and goals have been simulated in a data model, of which the groundwork was laid 25 years earlier in industry models for data warehousing and application development. Nonetheless, the accuracy, coverage and flexibility of the new data model was developed to a completely new level.

In the data model there is an integrated Data Catalog, which can automatically analyse data sources and feed AI algorithms giving instructions to the marketing, production and logistics of the virtual company. Every now and then, analytical bots call optimisation algorithms, which recalculate the parameters of business processes. The whole process is iterative and continuously learning. The new company has outsourced its basic functions and the management of outsourcing vendors has been delegated to its AI.

Human beings are still needed, and AI calls key persons of the virtual company to help with questions often related to value judgment issues. The management board is typically needed for ethical issues, but also for when there are sudden changes in the market. The company performs well having a profitable growth and the yearly sales revenue just exceeded 300 million euros, but the personnel is a bit worried.

The whole labour force, all five persons, has noticed lately that AI doesn´t need it so often anymore. As a matter of fact, AI proposed in the beginning of April to the CEO to start the negotiations of reducing the number of employees due to the decreased need. Soon this was revealed to be an April Fools´ joke from AI. The CEO thought herself that this was not the first time, and if it was a mistake, to develop the sense of humour in its advertising algorithm.

Artificial Intelligence

Image: AI can even learn to crack a joke

Learn more about Watson AI: How is Watson helping businesses across the globe to build a smarter future?

What was the answer?

In the metadata world where hard and boring work is required, a self-learning system sounds too good to be true. Did we eventually invent a philosopher´s stone, which crawls through all data available and automatically finds the golden nuggets of information? I am sorry to say that the answer is no.

If the question is whether the self-learning Data Catalogs can save Data Lakes, the answer is perhaps. They substantially help data scientists and other analysts and accelerate development. Because of this, Data Catalog solutions are worth considering. At best a Data Catalog quickly leads users and developers to the data they need without compromising data security.

How do IBM solutions support AI development?

I plan to write a follow-up describing the principles of IBM Data Catalog solutions and how they help the AI and analytics development.

More information and free trials: IBM Watson Knowledge Catalog

If you have any further questions, please do not hesitate to contact me at:

IBM Analytics Sales

More Big Data stories

Stand up and deliver

Stand up and deliver IT and comedy (stand up) seem to be 2 different worlds, maybe with the exception of cartoons like Dilbert that regularly make fun of the IT world. But can IT learn from comedy? Having done various types of comedy for more than 20 years, I answer with a definite YES! My […]

Continue reading

Why everyone should prioritize gender equality in leadership

Few organizations make gender equality in leadership a formal business priority, but those that do outperform. Despite abundant evidence that gender equality in leadership is good for business, an overwhelming majority of organizations say advancing women into leadership roles is not a formal business priority. In fact, women hold only 18 percent of senior leadership […]

Continue reading

COMING SOON: Think Summit in Finland, Denmark and Norway

It is about time for the tech events of the year in the Nordics. Think Summit 2019 in Finland, Denmark and Norway are soon taking place. Think Summits are the annual IBM flagship events where IBM clients and partners join us to explore the insights, ideas, and innovations that are shaping the way the world […]

Continue reading