Running automated discovery (Watson Knowledge Catalog)
Discover assets, analyze and classify them, assess the data quality, and register those data assets in the default catalog.
When you run a discovery, metadata is imported into the default catalog. You can choose any of the following options:
Analyze columns - Runs a column analysis on the data assets to identify properties like data class, data type, and format.
- Assign terms - Assigns business terms to imported assets based on name similarity, data classification, and machine learning insights. In the discovery results, you can publish selected term assignments or all of them.
Analyze data quality - Runs a quality analysis on the data assets to scan for common dimensions of data quality like missing values or data class violations.
Publish results to catalog - Publishes discovery results to the default catalog to make them available to other users. Results include term assignments, data class assignments, quality score, and other quality-related statistics.
Use data sampling - Run column analysis or data quality analysis on a data sample. Sampling can be configured globally for all data quality projects, for specific data quality projects, or at the discovery job level. Only users with the Administrator role can modify sampling settings at any of these levels. The global sampling settings are applied unless they are overwritten at the project or job level.
For data quality projects that you create here, the default sampling method is Sequential and the default sampling size is 1,000 records. If you have the required permissions, you can overwrite the sample size.
With sampling enabled, you can specify one of the following sampling techniques:
- Sequential (first n rows)
- The sample includes the first n records that you specify. For example, if you have 1,000,000 records, and you specify a sample size of 2,000, then the sample includes the first 2,000 records.
- Nth (use every nth row)
- The sample reads every nth interval that you specify until the number of records in the sample size is reached. For example, if you have 1,000,000 ecords and specify a sample size of 2,000 with an interval of 10, then a maximum of 20,000 records are read (2,000*10) with every 10th record selected o retrieve the sample size of 2,000.
- Random (randomized selection of rows)
- The sample randomly selects records in your sample size. The formula used to randomly select records is (100/sample_percent)sample_size2. The umber 2 is used in the formula to ensure that enough records are read to produce a valid random sample size. For example, if you have 1,000,000 records nd you specify a sample size of 2,000 and a percent of 5, the sample returns 2,000 records and reads, at most, 80,000 records ((100/5)2,0002=80,000).
- Required permissions
- To create and run discovery jobs, you must have these user permissions:
- Manage asset discovery
- Manage data quality
In addition, you must have the following project roles:
- To analyze columns and data quality, you must have the Business Analyst and the Data Operator project role in the data quality project that you select when starting a discovery.
- To publish analysis results, you must have the Business Analyst project role in the data quality project that you select when starting a discovery. Depending on the configuration, you might also need to have the Admin or the Editor role in the default catalog.
Go to Governance > Data discovery > New discovery job > Automated discovery.
Select the data connection that you want to use for the discovery. For a list of supported connections, see Discover assets. Any connection that you created thru metadata import will show up in the list of existing connections.
Choose from existing connections or add a new one:
Click Select a connection, then select an existing connection or click Find or add a connection.
If you selected an existing connection, continue with step 3.
On the Add existing connection page, select a connection or click New connection to create a new platform-level connection..
If you selected an existing connection, click Next and continue with step 3.
On the New connection page, select a connection type.
Provide the connection details. Enter a name for the new connection. Optionally, provide a description for the new connection. The more detailed the information is, the easier it will be to pick the appropriate connection later.
Provide any additional information as required for the selected connection type. For more information, see Connecting to data sources.
In the Discovery root field, select the asset on which you want to start the discovery. Click Browse and select assets.
Alternatively, type the discovery root manually. For example, for the HDFS connection type, specify a path to the root folder, as in /apps/hive/warehouse. Additionally, you can specify file extensions, like csv or txt. For a database connection type, specify schemas or database tables. You can discover a full database, or individual schemas and database tables. For JDBC connection types, the database name to use might differ from the actual database name, depending on the JDBC driver. For example, for Db2 you might need to use db2, and for Oracle ibm. Provide the value in the following format:
- To discover a full database, leave the field blank.
- To discover a schema, enter
- To discover a database table, enter
- To discover more than one schema or table, separate items with semi-colons, as in
Select the discovery options that you want to run during the discovery process.
Select the data quality project where you want to add the imported data. You can go to the project overview page by clicking the settings icon, and configure the project settings. You can set column analysis parameters, data sample settings, engine settings, or data quality settings. For more information, see the Project settings topic.
Note: Sampling settings that are specified in the data quality project are used only when no sampling settings are specified for the automated discovery job.
The discovery process might take a while, depending on the amount of data to import and analyze. Click the Refresh icon until the results are displayed. You can also check the status of the discovery in the Discovery results tab.
If you no longer need to run this discovery, you can cancel it when the analysis phase is in progress. You can’t cancel the discovery during the metadata import. You can cancel the analysis of individual assets, or of all assets which are being analyzed. Open the discovery job details in the automated discovery results page, and click the cancel icon, either for all assets, or for selected assets in the Actions column.