Self-Learning Data Catalogs – Data Catalogs’ Sequel

Share this post:

Self-Learning Data Catalogs – Data Catalogs’ Sequel

This is a sequel to the future progression of data catalogs. How to enhance and automate data management and analyzing processes with Data Catalogs by supporting the development of Artificial Intelligence (AI)? In these development processes, what are the main issues?


I ended my first blog post about data catalogs (Self-Learning Data Catalogs – Savior of Data Lakes or Populist Promise?) with a futuristic case of an imaginary company in the year 2044. Its management was outsourced to an AI, which asked the board’s advice only with tough decisions. It’s still a long way from autonomically managed companies, but what is the needed groundwork for it? How can current data catalogs help develop companies’ digital transformation and AI powered solutions?

The Arms Race of AI

Cars, planes and ships will soon travel on their own. The most difficult task is to create full automation, which operates perfectly also in exceptional traffic situations and other circumstances. The same applies to business. Considering the broadness of AI, the required AI to control business is much broader than the one needed to operate vehicles – a company decision making is significantly more challenging than steering a car. The constantly changing markets make it even more difficult.

The utilization of AI sets new challenges for companies. In genuinely global markets, the prize for the first mover is humongous. The required skills of key persons are being set higher and higher. Before, mastering your own field of expertise was enough. Now you should manage the modern tools supporting the development of AI and the principles of data utilization.

These new tools, for instance, are cognitive systems, machine learning and self-learning neural networks. They guide us on how to create even more versatile narrow AI utilizing solutions, as we wait for the breakthrough of general AI. New solutions share the same insatiable hunger for data. Not all data is compatible. Data needs to be high-quality and easily accessible. The data preparing and utilizing can be approached from a technical or an intuitive standpoint.

A technical approach emphasizes categorizing, managing, shared concepts and quality control. Intuitive, in this context, means that the users can easily utilize especially new data.

 An Engineer Models the Company

An engineer sees the market as an environment based on defined rules, where business can be modeled precisely. Different business games, that are used to support education in business schools, have shown that business can be perceived as a rule-guided world. Compared to chess, managing business (playing the game) executed with optimization (the best move) corresponds to the planning of production resources utilization within the given restrictions.  With this principle, the chess player – machine or human – calculates different variations and their consequences from the current situation.

In a rule-based world, the one who can create the best model and organize most efficiently wins. This requires understanding, not only of the company’s processes and competitive advantages but also of the behavior of clients and competitors.

Data management is the key to everything. In an ideal situation, data management models reflect perfectly the needs of the company, as well as the requirements of the industry. The business vocabulary unequivocally defines the key terms, and through it, data’s technical controlling can be connected to the concepts of business. As a user searches data with the term ‘private client’, he or she will see how it is defined in the business, along with the technical specifications of it. These specifications are, for instance, where the term’s refined version can be found (e.g. data warehouse), its original data sources, and how the data has been modified during the process. Once organized data is also easy to maintain, as long as the quality control processes have been planned and implemented.  See from this infographic what could you do with an elastic, cloud-based data warehouse.

A modeled and standardized world is an excellent field for solutions representing narrow AI. When the business rules and the available data are known, instructions to learn and optimize, for example, direct marketing or transport logistics, can be given to an AI.

AI utilizes organized enterprise data

Utilizing the game-like framework focuses on precisely restricted functions. Utilizing new ideas and data requires human participation in the development process.

An Artist Trusts Intuition

The nature of an artist is not to believe in the power of technique in perceiving a complex environment. They want to observe market without the regulations and given restrictions. In a game of chess, this would mean that by investing you could buy out opponents (acquisitions), increase the number of chess pieces (investment), make the chess pieces move better than opponents (R&D), protest opponents chess style (patent challenges), hire better players to your own team etc. New entrants may show up in the competitive arena and challenge the traditional way of playing chess, like Uber and Airbnb have done in the markets they have entered (transportation and accommodation).

In the world of an artist, the available data is alive, new data resources are being found all the time, and only the sky is the limit in utilizing AI. In that case, the data catalog tools should also adapt to constant change.

An intuitive approach without organizing data leads to a one-sided solution. In the worst-case scenario, the company’s own data pools can resemble “data system spaghetti”, where even an engineer couldn’t make any sense of it (picture below).

A classic data system spaghetti

A classic data system spaghetti

The chance to utilize data creatively is very challenging from the perspective of narrow AI. With the help of machine learning, it could observe how users search and utilize data, and give recommendations based on it. On top of it, the users’ recommendations on the data source functionality can be surveyed.

A Versatile Data Catalog Is Part of The Whole

The ideal solution would be to combine the optimized rule-based world with artistic creativity. New data sources can be searched and utilized to support development without any restrictions. A winning development team consists of a human and a learning AI. The open-source principle brings the development power of the global community to support the work being done in the company.

IBM provides a holistic solution where the perspectives of an engineer and an artist fuse into each other through utilizing open source communities.

When designing a data catalog, it’s therefore recommended to pay attention to the following points:

  1. An easy and versatile search tool, which can search by both keywords and content.
  2. Managing the depicting vocabulary of the company’s hierarchical business, where technical metadata directory is integrated. Automatic creation of metadata based on, for instance, machine learning. Data transparency increases users’ trust in data validity and supports the GDPR auditing process.
  3. Data quality control including analyzing data sources, data cleansing, monitoring and lifecycle management, also data categorizing and validating by the user.
  4. Managing different data sources. Besides traditional relation type and other structural data, connections to new data sources are needed (social media, IoT, documents, etc.). Image recognition as a part of utilizing and categorizing data pools.
  5. Data integration solutions: ETL (Extract, Transform and Load) and data virtualization.
  6. Platform independence. The solution should furthermore work in both on-premise and public cloud platforms.
  7. Showing the user automatically only permissible data. Integrated permissions in processes and projects.
  8. Support on the interfaces of open metadata. A possibility to move samples prepared with data catalog into open source analytics tools.
  9. User guidance. The system observes automatically how the user searches and utilizes data, and later makes recommendations based on this behavior (Netflix). The solution requests recommendations from the user (user’s reviews) about the usefulness of the data.
  10. Master Data Management (MDM), for example, with supporting reference data.

You can read more how the journey to AI and business-ready data begins with information architecture from this whitepaper.

Final Words

Data catalogs significantly facilitate AI development. At best, they guide the user to the correct data that is only permitted for them, regardless of if the data comes from the company’s own systems or from external data sources.

More information:

IBM Watson Knowledge Catalog:

If you have any further questions, please do not hesitate to contact me at

IBM Analytics Sales

More AI stories

With AI and Project Debater, IBMers InnovationJammed about new strategies after Covid-19

On May 12 -14 IBM employees ‘InnovationJammed’ about, where and how the company can and should innovate and transform itself after the crisis and to address customer’s new, disrupted situation and IBM’s own changed reality. Don’t know what an ‘IBM InnovationJam is?’ Under the title ‘Think Forward 2020’, we,  employees of IBM worldwide, were invited […]

Continue reading

The Rise of the Risk Manager in Business

The unpleasant surprise and disruption from the COVID-19 pandemic have put a finger on a soft spot of most companies’ risk management and possibly also lack of resiliency and agility when an unwanted, unexpected critical event occurs. Risk managers, if any exist in the organization, tend to live their careers in the back-office, mainly for […]

Continue reading

How a virtual assistant can help you

Do you ever think about how you approach navigating a website that you haven’t used before?  Do you tend to start by navigating through the menus? Do you instantly go for the search field? Maybe you try the chatbot or virtual assistant if there is one. Naturally, it depends on your purpose for using the […]

Continue reading