Running a quick scan (Watson Knowledge Catalog)
Run quick scan when you don’t know your data very well, and you want to analyze large data assets to see a general overview of the quality of the data.
When you run a quick scan, only a data sample is analyzed, and assets aren’t added to the default catalog. The sample is by default the first 1000 records of each data asset. You can modify this sample size.
Run a quick scan when you don’t know your data in detail. It quickly provides a general overview of the quality of your data. When you review the results, you can edit discovered term assignments. All other results are read-only. After you review the results, you can add selected data assets to a selected catalog. You can even publish assets to more than one catalog.
Quick scan runs the following tasks:
- Analyze columns - Examines the properties and characteristics of columns in a data asset and finds matching data classification. Column analysis is run by default and cannot be disabled.
- Analyze data quality - Identifies the common data quality problems and computes a data quality score for data assets and columns. Data quality analysis is optional. If you skip data quality analysis, you can also decide whether you want to run automatic term assignment.
- Assign terms - Assigns business terms to discovered assets based on name similarity and data classification. If you choose to do a data quality analysis, terms assignment is automatically run. You cannot skip it. In either case, selecting Use machine learning to assign terms is optional. This option enables a machine learning model to get more accurate results in term assignment, however, it might slow down the discovery process.
- Required permissions
- To run a quick scan, you must have these user permissions:
- Analyze data quality
- Discover assets
In addition, you must have the Business Analyst and the Data Operator project roles in the project that you select when starting a quick scan.
Watch the following video see how to discover assets from an external data source using Quick Scan.
This video provides a visual method as an alternative to following the written steps in this documentation.
-
Go to Governance > Data discovery > New discovery job > Quick scan.
-
Select the connection. For a list of supported connections, see Discover assets. Any connection that you created thru metadata import will show up in the list of existing connections.
Choose from existing connections or add a new one:
-
Click Select a connection, then select an existing connection or click Find or add a connection.
If you selected an existing connection, continue with step 3.
-
On the Add existing connection page, select a connection or click New connection to create a new platform-level connection..
If you selected an existing connection, click Next and continue with step 3.
-
On the New connection page, select a connection type.
-
Provide the connection details. Enter a name for the new connection. Optionally, provide a description for the new connection. The more detailed the information is, the easier it will be to pick the appropriate connection later.
Provide any additional information as required for the selected connection type. For more information, see Connecting to data sources.
-
Click Create.
-
-
In the Discovery root field, select the asset on which you want to start the discovery. Click Browse and select assets.
Alternatively, type the discovery root manually. For the JDBC connection types, specify schemas. You can discover a full database, individual schemas or tables. The database name to use might differ from the actual database name, depending on the JDBC driver. For example, for Db2 you might need to use db2, and for Oracle ibm. Provide the value in the following format:
- To discover a full database, leave the field blank.
- To discover a schema, enter
schema[db_name|schema_name]
. - To discover more than one schema, separate items with semi-colons, as in
schema[db_name|schema_name];schema[db_name|schema_name]
. - To discover an individual table, enter
table[db_name|schema_name|table_name]
. - To discover several tables, separate items with semi-colons, as in
table[db_name|schema_name|table_name];table[db_name|schema_name|table_name]
.
-
Specify the project where you want the discovered data assets to be loaded. Select an existing project or create a new one. If you select an existing project and have the required permissions, you can edit the project settings.
This project serves as a working area. The discovered assets are not visible in the selected project and cannot be accessed there. They can be accessed from the quick scan results only. Only collaborators in the project have access to the quick scan job and the results. The selected project also determines the lifecycle of the quick scan job. When a project is deleted, all quick scan jobs tied it are also deleted.
To make the discovered assets available in a data quality project, you must publish them to the default catalog in the review step. The assets are then synced to the Information assets view from where you can add the assets to any data quality project.
-
Optionally, change the discovery options for this job. All options are selected by default. You cannot disable column analysis or sampling, but you can change the sample size. To skip data quality analysis, clear the Analyze data quality checkbox. In this case, you can also change the settings for automatic term assignment.
-
Click Discover.
After you start a discovery job, you’re taken to the Pending analysis tab of the project selected for the job. Here you can see the jobs that are queued for analysis or in progress. While a job is in progress, you can pause it at any time, and resume it from the same point later. Discovery jobs can take some time. You can leave the page at any time and return later to check whether your job completed. Alternatively, you can wait and refresh the page from time to time. After a job completes, it is moved to the Action required, regardless of whether it completed successfully or failed.
You can rerun discovery jobs at any time, even failed jobs. Rerun discovery to check whether data changed after you last ran the discovery. Select a job and click Discover again. The discovery job is started again and is moved to the Pending analysis tab. During rediscovery, only new and changed data assets are analyzed. Deletions are not taken into account. The latest glossary information is applied, which means that any additions or updates to terms and data classes are reflected as appropriate. Data assets without changes are skipped during rediscovery. To reanalyze data assets in the discovery scope that haven’t changed since the last discovery run, you must create a new quick scan job with the same settings as the original one.
If the discovery job completed successfully, review and work with the results. For more information, see the Reviewing and working with the quick scan results topic.