A data catalog uses metadata—data that describes or summarizes data—to create an informative and searchable inventory of all data assets in an organization. These assets can include (but are not limited to) these things:
This inventory enables data citizens—data analysts, data scientists, data stewards, and other data professionals with access to corporate data—to search through all of an organization’s available data assets and help themselves to the most appropriate data for their analytical or business purposes.
A data catalog typically includes capabilities for collecting and continually enriching—or curating—the metadata associated with each data asset in order to make each asset easier to identify, evaluate, and use properly. The catalog also provides tools that enable users to do the following:
Building on the brief definition above, metadata is data that describes a data asset or provides information about the asset that makes it easier to locate, evaluate, and understand.
The classic or most commonly used example of metadata is the card catalog or online catalog at a library. In these, each card or listing contains information about a book or publication (e.g., title, author, subject, publication date, edition, location within the library, and summary or synopsis) that makes the publication easier for a reader to find and to evaluate. For example: Is it current or outdated? Does it have the information I’m looking for? Is the author someone I trust or whose work I enjoy?
There are many classes of metadata, but a data catalog deals primarily with three: technical metadata, process metadata, and business metadata.
Technical metadata (also called structural metadata) describes how the data is organized and displayed to users by describing the structure of the data objects—such as tables, columns, rows, indexes, and connections. Technical metadata tells data professionals how they will need to work with the data—for example, if they can work with it as is, or if they need to transform it for analysis or integration.
Process metadata (also called administrative metadata) describes the circumstances of the data asset’s creation and when, how, and by whom it has been accessed, used, updated, or changed. It should also describe who has permission to access and use the data.
Process metadata provides information about the asset’s history and lineage, which can help an analyst decide if the asset is recent enough for the task at hand, if it comes from a reliable source, if it has been updated by trustworthy individuals, and so on. Process metadata can also be used to troubleshoot queries. And increasingly, process metadata is mined for information on software users or customers, such as what software they’re using and the level of service they’re experiencing.
Business metadata (sometimes referred to as external metadata) describes the business aspects of the data asset—the business value it has to the organization, its fitness for a particular purpose or various purposes, information about regulatory compliance, and more. Business metadata is where data professionals and line-of-business users speak the same language about data assets.
At a minimum, a data catalog should make it easy to find (or harvest) and organize all the existing metadata associated with any data asset in your organization. It should also provide tools that enable data experts to curate and enrich that metadata with tags, associations, ratings, annotations, and any other information and context that helps users find data faster and use it with confidence.
A data catalog requires a significant investment in software and in data citizens’ time and effort—an investment most organizations only want to make once. When evaluating data catalog solutions, look for the following capabilities (in addition to the metadata management capabilities mentioned above):
When data professionals can help themselves to the data they need—without IT intervention, without having to rely on finding experts or colleagues for advice, without limiting themselves to only the assets they know about, and without having to worry about governance and compliance—the entire organization benefits.
A data catalog can also help your organization meet specific technical and business challenges and objectives. By providing analysts with a single, comprehensive view of their customers, a data catalog can help uncover new opportunities for cross-selling, up-selling, targeted promotions and more. And by promoting, simplifying, or automating governance, a data catalog can help you implement data lake governance that prevents data swamps and provides the policy framework for designing, deploying, and monitoring AI models with a focus on fairness, accountability, safety, and transparency.
IBM Watson Knowledge Catalog is an open and intelligent data catalog for enterprise data and AI model governance, quality, and collaboration. It helps data citizens quickly discover, curate, categorize, and share data assets, data sets, analytical models, and their relationships with other members of your organization.
Powered the IBM Cloud Pak for Data, Watson Knowledge Catalog serves as a single source of truth for data engineers, data stewards, data scientists, and business analysts to gain self-service access to data they can trust. It also delivers data governance, data quality, and active policy management to help your organization protect and govern sensitive data, trace data lineage, manage data lakes, and prepare for your journey to AI.
Activate business-ready data for AI and analytics with intelligent cataloging, backed by active metadata and policy management
Automate how data across a hybrid data and cloud landscape is discovered, catalogued, and enriched for user relevancy. Provide access to business-ready data to more people.
Learn about the automation capabilities of a data catalog and how organizations are creating new business models and preparing for AI.
See why Forrester named IBM Watson Knowledge Catalog as a Leader in The Forrester Wave™: Machine Learning Data Catalogs.