Running automated discovery
Discover assets to add data to the catalog. During the discovery the data is imported, analyzed, and classified.
Before you begin
- A connection to the InfoSphere® Information Analyzer analysis database (iadb) and a new import area that you create by using InfoSphere Metadata Asset Manager. If the connection name is not iadb, then you must follow step 4 to specify the new connection name.
- A project. For more information, see the Working in projects and data assets topic.
- The following security roles:
- Common Metadata Importer or Common Metadata Administrator role to register or import metadata.
- Data Administrator user role to add data assets to a project.
- Data Operator role at the project level to run an analysis.
- Business Analyst role at the project level to publish analysis results.
- InfoSphere DataStage® credentials mapped so that underlying InfoSphere DataStage jobs can run successfully.
About this task
When you run the automated discovery, metadata is imported into the catalog. You can choose any
of the following options:
- Analyze columns - Runs a column analysis on the data assets to identify properties like data class, data type, and format.
- Assign terms - Assigns business terms to imported assets based on name similarity, data classification, and machine learning insights. In the discovery results, you can publish selected term assignments or all of them.
- Analyze data quality - Runs a quality analysis on the data assets to scan
for common dimensions of data quality like missing values or data class violations. Optionally, as
part of the quality analysis, you can specify one of the following sampling techniques:
- Sequential (first n rows)
- The sample includes the first n records that you specify. For example, if you have 1,000,000 records, and you specify a sample size of 2,000, then the sample includes the first 2,000 records.
- Nth (use every nth row)
- The sample reads every nth interval that you specify until the number of records in the sample size is reached. For example, if you have 1,000,000 records and specify a sample size of 2,000 with an interval of 10, then a maximum of 20,000 records are read (2,000*10) with every 10th record selected to retrieve the sample size of 2,000.
- Random (randomized selection of rows)
- The sample randomly selects records in your sample size. The formula used to randomly select records is (100/sample_percent)*sample_size*2. The number 2 is used in the formula to ensure that enough records are read to produce a valid random sample size. For example, if you have 1,000,000 records and you specify a sample size of 2,000 and a percent of 5, the sample returns 2,000 records and read, at most, 80,000 records ((100/5)*2,000*2=80,000).
- Publish results to catalog - Publishes discovery results to the catalog to make them available to other users. Results include term assignments, data class assignments, quality score, and other quality-related statistics.
- Supported connectors and data sources
- You can run automated discovery for Amazon S3, BigQuery, Db2, Greenplum, HDFS, Hive, Microsoft SQL Server, MongoDB, Netezza, Oracle, PostgreSQL, Snowflake, and Teradata data sources, and data sources that you connect to through the ODBC, JDBC, or File Connector connections. For information about running automated discovery from the command line, see Using automated discovery from the command line.
- Required project roles
- To analyze columns and data quality, you must have the Data Operator project role in the project that you select when starting a discovery.