What is IBM Watson Knowledge Catalog?
IBM Watson® Knowledge Catalog is a cloud-based enterprise metadata repository that lets you catalog your knowledge and analytics assets, including machine learning models and structured and unstructured data wherever they reside, so that they can be easily accessed and used to fuel data science and all forms of AI.
For selected source types, Watson™ Knowledge Catalog can automatically discover and register data assets at the provided connection. As assets are added to the catalog, they are automatically indexed and classified, making it easy for users such as data engineers, data scientists, data stewards and business analysts to find, understand, share and use the assets. AI-powered search and recommendations guide users to the most relevant assets in the catalog based on understanding of relationships between assets, how those assets are used, and social connections between users.
Watson Knowledge Catalog also provides an intelligent and robust governance framework that lets you define and enforce data and access policies to ensure that the right data goes to the right people.
Through the Watson Knowledge Catalog Business Glossary, users can create a common business vocabulary and associate them to your assets, policies and rules, providing the bridge between the business domain and your technical assets.
Do I need to move my data into Watson Knowledge Catalog?
No. You can keep your data in its existing repositories. Watson Knowledge Catalog stores the metadata of your assets.
What data sources and asset types are supported?
IBM provides over 30 connectors to cloud or on-premises data-source types that will allow you to connect to your remote data assets. For example, connectors to IBM Db2® in the cloud or on premises, IBM Cloudant®, IBM Cloud™ Object Storage, Oracle, Microsoft SQL Server, Microsoft Azure, Amazon S3, Salesforce.com, Hortonworks HDFS, Sybase and many more are available from IBM.
In addition to assets from remote data sources, Watson Knowledge Catalog supports other asset types, such as structured (row/column), semi-structured and unstructured data. For example, you can add CSV, Microsoft Excel, PDF, Text, Microsoft Word, Jupyter Notebook (IPYNB), image and HTML files, to name a few, to the catalog to profile and share with other users.
What is the maximum number of assets I can have in Watson Knowledge Catalog?
With the Professional plan, there is no limit in the number of assets you can have in Knowledge Catalog. With the Standard and Lite plans, the limits are 500 and 50 assets, respectively.
Does Watson Knowledge Catalog provide governance services?
Watson Knowledge Catalog includes an automated policy-enforcement engine that will determine outcomes based upon the policies and the action that has taken place. Watson Knowledge Catalog provides the ability to set up your governance policies within the system, so that you can restrict access to data or transform the data by masking sensitive content.
Can you delete or change the original source of data with a data policy that masks data?
No. When a data-protection policy anonymizes sensitive data in the catalog, only the preview data that is managed by the application is transformed. The original source data is not modified.
Does Watson Knowledge Catalog provide classification services?
Watson Knowledge Catalog can automatically classify columns in your data assets when they are added to the catalog. Built-in components provide over 160 attribute classifiers, including names, emails, postal addresses, credit card numbers, driver's license numbers, government identification numbers, dates of birth, demographic information, Data Universal Numbering System (DUNS) numbers and more. Catalogs also profile unstructured data assets and extract metadata from content, such as categories, concepts, sentiment and emotion. See Profile data assets.
Are there data-preparation capabilities in Watson Knowledge Catalog?
Yes. Data-preparation capabilities are available through Data Refinery, which is part of Watson Knowledge Catalog. Data Refinery provides a rich set of capabilities that not only allow you to discover, cleanse, and transform your data with built-in operations, but it also comes with powerful profiling and visualization tools, such as charts, graphs and stats to help you interact with and understand your data. Data-access-and-transform policies defined in Watson Knowledge Catalog are also enforced in Data Refinery to ensure that sensitive data that originated from governed catalogs remains protected.
Can you set up access groups for people in different lines of business?
Yes. Access groups can be set up through IBM Cloud Identity and Asset Management. In the Access Control module of Watson Knowledge Catalog, you can add a collaborator or a user group.
What are capacity unit hours?
Data Refinery flows, Data Refinery interactive UI, and profiling jobs are charged for the number of whole or capacity units required per hour for each capacity type:
- Data Refinery flows require 1.5 capacity units per hour with a default Spark environment. For other custom environments, the calculation depends on the number of executioners and resources used for Spark driver and executor.
- Data Refinery interactive UI requires 1.5 capacity units per hour – beginning when the refinery UI starts and ending when it is closed.
- Profiling jobs require six capacity units per hour. A minimum charge of 0.96 (equivalent to 10 minutes) will apply for each job execution.
A set number of free capacity unit hours are included in each plan for the month. For Standard and Professional plans, charges will apply after the plan limit is reached for that month. For a Lite plan, after the plan limit for that month is reached, no Data Refinery flows or profiling jobs can be run until the next month, or until the plan is upgraded to the Standard or Professional plan.
Data Refinery flow examples using default Capacity Type 3:
- One Data Refinery flow runs for 1 hour: 1.5 CUHs
- Two Data Refinery flows run for 1 hour each: 2 hours * 1.5 CUHs = 3 CUHs
- One Data Refinery flow runs for 30 minutes: 0.5 hours * 1.5 CUHs = 0.75 CUHs
- Interactive Data Refinery UI is used for 1 hour: 1.5 CUHs
Profiling examples (profiling jobs can be automatically or manually triggered):
- A Profiling job runs for 30 minutes: 0.5 hours * 6 CUHs = 3 CUHs
- A Profiling job runs for 9 minutes. The minimum charge applies in this scenario: 0.16 hours * 6 CUHs = 0.96 CUHs
After purchase of a Standard or Professional plan, how much set up is required in order to take full advantage of the product?
Watson Knowledge Catalog is all self service, so an administrator can start by creating a catalog, then adding and curating assets right away. Additional tasks can include:
- Building a business glossary
- Defining data protection policies to govern access to data
- Inviting users to the catalog
Is this available on IBM Cloud Pak for Data?
Yes. Explore more about IBM's latest integrated data platform: IBM Cloud Pak™ for Data
Try Watson Knowledge Catalog
Take advantage of machine learning and AI to analyze your data. Catalog your data to make it easy to find and use.