Discovering assets (Watson Knowledge Catalog)

Discover assets to get insight about the quality and business content of tables and files analyzed from various data connections. You can choose between a quick scan and automated discovery.

When the number and size of assets is unknown, and you need a quick view of their data quality, quick scan is the quickest method. When you have a good first-level understanding of the assets, you can then take a subset of those files, perhaps the most interesting or most useful, and run a deeper analysis by using automated discovery.

Quick scan

Quick scan analyzes a sample of each table or file to quickly provide analysis results, including data quality score, and automatically assigned data classes and business terms. You can review the assets, any data quality information, and term and data class assignments from the quick scan results. After reviewing, you can publish the assets along with the other information to one or more catalogs.

Quick scan is best suited to get a fast initial analysis of large numbers of tables and files from data sources that you might not be familiar with. After quick scan completes, you can review the results and decide on which data assets you want to run a deeper analysis, expanding beyond the initial sampling.

Automated discovery provides detailed analysis results of all assets from data sources. Unlike quick scan, with automated discovery the metadata and analysis results are automatically imported into the default catalog. The analysis results are available for viewing in a project and include data quality score, automatically assigned data classes and business terms, data types, formats, frequency distributions, and more.

This type of discovery is suitable for smaller numbers of tables and files from data sources, or from subsets (schemas or file paths) of data sources. You can use automated discovery when you already have a general overview of the quality and business content of your data, and you want to see and review the additional details.

Supported connections

The following table lists the data sources from which quick scan or automated discovery can discover assets.

Important:

Data source Connection type Quick scan Automated discovery Synchronization¹
Amazon DynamoDB JDBC Not supported Connection created through metadata import² Data assets
Amazon S3
(CSV files only)
Amazon S3 Not supported Connection created through metadata import² Not supported
Add asset to the catalog directly.
Apache Cassandra JDBC Not supported Connection created through metadata import² Data assets
Apache Kudu Generic JDBC Platform-level connection
Include the actual values of any driver configuration options such as SSL options directly in the JDBC URL. Values that you define in the JDBC properties field and add to the JDBC URL as variables are not resolved.
Platform-level connection
Include the actual values of any driver configuration options such as SSL options directly in the JDBC URL. Values that you define in the JDBC properties field and add to the JDBC URL as variables are not resolved.
Not supported
Add asset to the catalog directly.
Data Virtualization Manager JDBC Not supported Connection created through metadata import² Data assets
Db2® Db2 Platform-level connection Platform-level connection Data assets
Connection (native and JDBC)
Db2 Warehouse Db2 Warehouse Platform-level connection Platform-level connection Data assets
Google BigQuery JDBC Not supported Connection created through metadata import² Data assets
HDFS Apache HDFS Not supported Platform-level connection or created through metadata import²

Additional considerations apply for this type of connection.
For more information, see Known issues with Hive or HDFS connections for data discovery.
Data assets
Hive Apache Hive Platform-level connection or created through metadata import²

Additional considerations apply for this type of connection.
For more information, see Known issues with Hive or HDFS connections for data discovery.
Platform-level connection or created through metadata import²

Additional considerations apply for this type of connection.
For more information, see Known issues with Hive or HDFS connections for data discovery.
Data assets
Microsoft Azure Data Lake Store Microsoft Azure Datalake Storage Connector Not supported Connection created through metadata import² Data assets
Microsoft™ SQL Server Third party: Microsoft SQL Server Platform-level connection Platform-level connection Data assets
Connection (JDBC)
Mongo DB Third party: MongoDB Platform-level connection Platform-level connection Data assets
MySQL
(Enterprise Edition)
ODBC Not supported Connection created through metadata import² Data assets
Netezza®
(PureData System for Analytics)
ODBC or IBM Netezza Connector Not supported Connection created through metadata import² Data assets
Oracle Third party: Oracle Platform-level connection Platform-level connection Data assets
Connections (JDBC)
Pivotal Greenplum (Greenplum) ODBC Not supported Connection created through metadata import² Data assets
PostgreSQL JDBC Connection created through metadata import²
Additional configuration is required for publishing results.³
Connection created through metadata import² Data assets
SAP HANA JDBC Not supported Connection created through metadata import² Data assets
Snowflake Generic JDBC Platform-level connection Platform-level connection Data assets
Sybase ODBC Not supported Connection created through metadata import² Data assets
Teradata Generic JDBC Platform-level connection Platform-level connection Data assets
Connections (JDBC)

Table notes:

1) The following information assets are synchronized from the information assets view to the default catalog:

The following data assets are synchronized between the default catalog and information assets view:

For more information, see Information assets view.

2) Metadata import must be enabled. For more information about creating such connections, see Creating metadata import connections for discovery.

3) To be able to publish quick scan results from a PostgreSQL connection created through metadata import, you must define a platform-level connection with exactly the same name (case-sensitive) selecting Third party: PostgreSQL as the type of data source. Otherwise, publishing will fail with the error Connection not found. Not that this platform-level connection is for publishing purposes only. You cannot use it for discovery.

Required user permissions and data quality project roles

To run automated discovery or quick scan, you need the following user permission and data quality project roles:

To view, cancel, delete, or rerun discovery jobs, you must be the owner of the discovery job, an isadmin user, or must have the Data Operator and the Business Analyst role in the data quality project referenced in the discovery job.

To review and publish analysis results, you must have the Data Operator and the Business Analyst role in the data quality project referenced in the discovery job. To publish assets, you must also be a collaborator with the Admin or the Editor role in the catalog to which you want to publish. For automated discovery, this can be configured differently.

Permissions required for working with discovery jobs and results are as follows:

Action Permission and role
View discovery jobs View data quality or Discover assets
Owner of the discovery job, isadmin user, or Data Steward role or both the Data Operator and the Business Analyst roles in the data quality project referenced in the discovery job
Cancel discovery jobs Discover assets
Owner of the discovery job, isadmin user, or Data Operator and Business Analyst roles in the data quality project referenced in the discovery job
Delete discovery jobs Discover assets
Owner of the discovery job, isadmin user, or Data Operator and Business Analyst roles in the data quality project referenced in the discovery job
Rerun discovery jobs Discover assets
Owner of the discovery job, isadmin user, or Data Operator and Business Analyst roles in the data quality project referenced in the discovery job
Review and publish analysis results View data quality or Discover assets
Business Analyst or Data Steward role in the data quality project referenced in the discovery job. To publish quick scan results, you must also be a collaborator with the Admin or the Editor role in the catalog to which you want to publish. For automated discovery results, the required catalog collaborator role depends on the configuration.

Learn more