What is a data catalog?
A data catalog uses metadata—data that describes or summarizes data—to create an informative and searchable inventory of all data assets in an organization. These assets can include (but are not limited to) these things:
- Structured (tabular) data
- Unstructured data, including documents, web pages, email, social media content, mobile data, images, audio, and video
- Reports and query results
- Data visualizations and dashboards
- Machine learning models
- Connections between databases
This inventory enables data citizens—data analysts, data scientists, data stewards, and other data professionals with access to corporate data—to search through all of an organization’s available data assets and help themselves to the most appropriate data for their analytical or business purposes.
A data catalog typically includes capabilities for collecting and continually enriching—or curating—the metadata associated with each data asset in order to make each asset easier to identify, evaluate, and use properly. The catalog also provides tools that enable users to do the following:
- Search the catalog
- Automate the discovery of potentially relevant data for which they didn’t specifically search
- Govern the use of the data in compliance with industry or government regulations
What is metadata?
Building on the brief definition above, metadata is data that describes a data asset or provides information about the asset that makes it easier to locate, evaluate, and understand.
The classic or most commonly used example of metadata is the card catalog or online catalog at a library. In these, each card or listing contains information about a book or publication (e.g., title, author, subject, publication date, edition, location within the library, and summary or synopsis) that makes the publication easier for a reader to find and to evaluate. For example: Is it current or outdated? Does it have the information I’m looking for? Is the author someone I trust or whose work I enjoy?
There are many classes of metadata, but a data catalog deals primarily with three: technical metadata, process metadata, and business metadata.
Technical metadata (also called structural metadata) describes how the data is organized and displayed to users by describing the structure of the data objects—such as tables, columns, rows, indexes, and connections. Technical metadata tells data professionals how they will need to work with the data—for example, if they can work with it as is, or if they need to transform it for analysis or integration.
Process metadata (also called administrative metadata) describes the circumstances of the data asset’s creation and when, how, and by whom it has been accessed, used, updated, or changed. It should also describe who has permission to access and use the data.
Process metadata provides information about the asset’s history and lineage, which can help an analyst decide if the asset is recent enough for the task at hand, if it comes from a reliable source, if it has been updated by trustworthy individuals, and so on. Process metadata can also be used to troubleshoot queries. And increasingly, process metadata is mined for information on software users or customers, such as what software they’re using and the level of service they’re experiencing.
Business metadata (sometimes referred to as external metadata) describes the business aspects of the data asset—the business value it has to the organization, its fitness for a particular purpose or various purposes, information about regulatory compliance, and more. Business metadata is where data professionals and line-of-business users speak the same language about data assets.
At a minimum, a data catalog should make it easy to find (or harvest) and organize all the existing metadata associated with any data asset in your organization. It should also provide tools that enable data experts to curate and enrich that metadata with tags, associations, ratings, annotations, and any other information and context that helps users find data faster and use it with confidence.
Data catalog tools—what to look for
A data catalog requires a significant investment in software and in data citizens’ time and effort—an investment most organizations only want to make once. When evaluating data catalog solutions, look for the following capabilities (in addition to the metadata management capabilities mentioned above):
- An excellent data ‘shopping’ experience that includes data discovery: The goal of a data catalog is to enable all your data citizens to serve themselves to the data they need. You should expect a search experience equal to those of Netflix, Amazon, or other popular commercial online experiences, where anyone can quickly find results based on the metadata they search for and also receive relevant recommendations and/or warnings based on ratings and reviews from other users.
- Simplified compliance: Keeping data compliant is almost humanly impossible; at this writing, 107 countries have enacted regulations to protect personal data privacy alone. A data catalog should simplify compliance by profiling data assets, inferring their relevance to specific regulations, and automatically classifying and tagging them for future reference. Machine learning capabilities are powerful work savers here.
- Connections to a wide variety of data sources: In order to serve as an enterprise-wide data asset inventory, a data catalog needs to connect to all the assets in your enterprise. Look for connections to all the types of assets you have now and a commitment to building out connections going forward. Also look for a catalog you can deploy wherever your data resides—on-premises or in a public, private, hybrid, or hybrid multicloud environment.
- Support for quality and governance that ensures trusted data: A data catalog should integrate seamlessly with any quality and governance programs and tools you have in place, including data quality rules, business glossaries, and workflows.
- Support for ‘explainable AI’: Increasingly, data governance is responsible for managing artificial intelligence (AI) models—not only understanding the data used, but the how different inputs influence decisions and results. Make sure any data catalog you choose helps tag and prepare data assets for optimal use and transparency in your AI models.
Data catalog benefits
When data professionals can help themselves to the data they need—without IT intervention, without having to rely on finding experts or colleagues for advice, without limiting themselves to only the assets they know about, and without having to worry about governance and compliance—the entire organization benefits.
- Better understanding of data through improved context: Analysts can find detailed descriptions of data, including comments from other data citizens, and better understand how data is relevant to the business.
- Increased operational efficiency: A data catalog creates an optimal division of labor between users and IT—data citizens can access and analyze data faster, and IT staff can spend more time focusing on high-priority tasks.
- Reduced risk: Analysts have greater confidence that they’re working with data they’re authorized to use for a given purpose, in compliance with industry and data privacy regulations. They can also quickly review annotations and metadata to spot null fields or incorrect values that can impact analysis.
- Greater success with data management initiatives: The more difficult it is for data analysts to find, access, prepare, and trust data, the less likely it is that business intelligence (BI) initiatives and big data projects will be successful.
- Better data and better analysis, faster—a competitive advantage: Data professionals can respond rapidly to problems, challenges, and opportunities with analysis and answers based on all of the most appropriate, contextual data within the organization.
A data catalog can also help your organization meet specific technical and business challenges and objectives. By providing analysts with a single, comprehensive view of their customers, a data catalog can help uncover new opportunities for cross-selling, up-selling, targeted promotions and more. And by promoting, simplifying, or automating governance, a data catalog can help you implement data lake governance that prevents data swamps and provides the policy framework for designing, deploying, and monitoring AI models with a focus on fairness, accountability, safety, and transparency.
Data catalog and IBM Cloud
IBM Watson Knowledge Catalog is an open and intelligent data catalog for enterprise data and AI model governance, quality, and collaboration. It helps data citizens quickly discover, curate, categorize, and share data assets, data sets, analytical models, and their relationships with other members of your organization.
Powered the IBM Cloud Pak for Data, Watson Knowledge Catalog serves as a single source of truth for data engineers, data stewards, data scientists, and business analysts to gain self-service access to data they can trust. It also delivers data governance, data quality, and active policy management to help your organization protect and govern sensitive data, trace data lineage, manage data lakes, and prepare for your journey to AI.