August 1, 2017 | Written by: Jay Limburn and Thomas Schaeck
Categorized: Data Analytics
Share this post:
The core features comprising Watson Data Platform, Data Science Experience and Data Catalog on IBM Cloud, along with additional embedded AI services, including machine learning and deep learning, are now available in Watson Studio and Watson Knowledge Catalog. Get started for free at https://ibm.co/watsonstudio.
One of the earliest documented catalogs was compiled at the great library of Alexandria in the third century BC, to help scholars manage, understand and access its vast collection of literature. While that cataloging process represented a massive undertaking for the Alexandrian librarians, it pales in comparison to the task of wrangling the volume and variety of data that modern organizations generate.
Nowadays, data is often described as an organization’s most valuable asset, but unless users can easily sift through data artifacts to find the information they need, the value of that data may remain unrealized. Catalogs can solve this problem by providing an indexed set of information about the organization’s data, storing metadata that describes all assets and providing a reference to where they can be found or accessed.
It’s not just the size and complexity of the data that makes cataloging a tough challenge: organizations also need to be able to perform increasingly complicated operations on that data at high speed, and even in real-time. As a result, technology leaders must continually find better ways to solve today’s version of the same cataloging challenges faced in Alexandria all those years ago.
IBM’s aim with Watson Data Platform is to make data accessible for anyone who uses it. An integral part of Watson Data Platform will be a new intelligent asset catalog, IBM Data Catalog, a solution underpinned by a central repository of metadata describing all the information managed by the platform. Unlike many other catalog solutions on the market, the intelligent asset catalog will also offer full end-to-end capabilities around data lifecycle and governance.
Because all the elements of Watson Data Platform can utilize the same catalog, users will be able to share data with their colleagues more easily, regardless of what the data is, where it is stored, or how they intend to use it. In this way, the intelligent asset catalog will unlock the value held within that data across user groups—helping organizations use this key asset to its full potential.
Breaking down silos
With Watson Data Platform, data engineers, data scientists and other knowledge workers throughout an enterprise can search for, share and leverage assets (including datasets, files, connections, notebooks, data flows, models and more). Assets can be accessed using the Data Science Experience web user interface to analyze data,
To collaborate with colleagues, users can put assets into a Project that acts as a shared sandbox where the whole team can access and utilize them. Once their work is complete, they can submit any resulting content to the catalog for further reuse by other people and groups across the organization.
Rich metadata about each asset makes it easy for knowledge workers to find and access relevant resources. Along with data files, the catalog can also include connections to databases and other data sources, both on- and off-premises, giving users a full 360-degree view to all information relevant to their business, regardless of where or how it is stored.
Managing data over time
It’s important to look at data as an evolving asset, rather than something that stays fixed over time. To help manage and trace this evolution, IBM Data Catalog will keep a complete track of which users have added or modified each asset, so that it is always clear who is responsible for any changes.
Smart catalog capabilities for big data management
The concept of catalogs may be simple, but when they’re being used to make sense of huge amounts of constantly changing data, smart capabilities make all the difference. Here are some of the key smart catalog functionalities that we see as integral to tackling the big data challenge.
Data and asset type awareness
When a user chooses to preview or view an asset of a particular type, the data and asset type awareness feature will automatically launch the data in the best viewer—such as a shaper for a dataset, or a canvas for a data flow. This will save time and boost productivity for users, optimizing discovery and making it easier to work with a variety of data types without switching tools.
Intelligent search and exploration
By combining metadata, machine learning-based algorithms and user interaction data, it is possible to fine-tune search results over time. Presenting users with the most relevant data for their purpose will increase usefulness of the solution the more it is used.
Effective use of data throughout your organization is a two-way street: when users discover a useful dataset, it’s important for them to help others find it too. Users can be encouraged to engage by taking advantage of curation features, enabling them to tag, rank and comment on assets within the catalog. By augmenting the metadata for each asset, this can help the catalog’s intelligent search algorithms guide users to the assets that are most relevant to their needs.
If data is incomplete or inaccurate, utilizing it can cause more problems than it solves. On the other hand, if data is accurate but users do not trust it, they might not use it when it could make a real difference. In either scenario, data lineage can help.
Data lineage captures the complete history of an asset in the catalog: from its original source, through all the operations and transformations it has undergone, to its current state. By exploring this lineage, users can be confident they know where assets have come from, how those assets have evolved, and whether they can be trusted.
Taking a step back to a higher-level view, monitoring features will help users keep track of overall usage of the catalog. Real-time dashboards help chief data officers and other data professionals monitor how data is being used, and identify ways to increase its usage in different areas of the organization.
We have already mentioned that data needs to be seen as an evolving asset—which means our catalogs must evolve with it. We plan to make it easy for users to augment assets with metadata manually; in the future, it may also be possible to integrate algorithms that can discover assets and capture their metadata automatically.
For many organizations, keeping data secure while ensuring access for authorized users is one of the most significant information management challenges. You can mitigate this challenge with rule-based access control and automatic enforcement of data governance policies.
Finally, the catalog will enable access to all these capabilities and more through a set of well-defined, RESTful APIs. IBM is committed to offering application developers easy access to additional components of Watson Data Platform, such as persistence stores and data sets. We hope that they can use our services to extend their current suite of data and analytics tools, to innovate and create smart new ways of working with data.
For a deeper dive into the challenges of data governance, take a look at our next blog post in this series, “Data governance – You could be looking at it all wrong”.