Running automated discovery

Discover assets to add data to the catalog. During the discovery the data is imported, analyzed, and classified.

Before you begin

To discover data, you must have:
  • A connection to the InfoSphere® Information Analyzer analysis database (iadb) and a new import area that you create by using InfoSphere Metadata Asset Manager. If the connection name is not iadb, then you must follow step 4 to specify the new connection name.
  • A project. For more information, see the Working in projects and data assets topic.
  • The following security roles:
    • Common Metadata Importer or Common Metadata Administrator role to register or import metadata.
    • Data Administrator user role to add data assets to a project.
    • Data Operator role at the project level to run an analysis.
    • Business Analyst role at the project level to publish analysis results.
  • InfoSphere DataStage® credentials mapped so that underlying InfoSphere DataStage jobs can run successfully.

About this task

When you run the automated discovery, metadata is imported into the catalog. You can choose any of the following options:
  • Analyze columns - Runs a column analysis on the data assets to identify properties like data class, data type, and format.
  • Assign terms - Assigns business terms to imported assets based on name similarity, data classification, and machine learning insights. In the discovery results, you can publish selected term assignments or all of them.
  • Analyze data quality - Runs a quality analysis on the data assets to scan for common dimensions of data quality like missing values or data class violations. Optionally, as part of the quality analysis, you can specify one of the following sampling techniques:
    Sequential (first n rows)
    The sample includes the first n records that you specify. For example, if you have 1,000,000 records, and you specify a sample size of 2,000, then the sample includes the first 2,000 records.
    Nth (use every nth row)
    The sample reads every nth interval that you specify until the number of records in the sample size is reached. For example, if you have 1,000,000 records and specify a sample size of 2,000 with an interval of 10, then a maximum of 20,000 records are read (2,000*10) with every 10th record selected to retrieve the sample size of 2,000.
    Random (randomized selection of rows)
    The sample randomly selects records in your sample size. The formula used to randomly select records is (100/sample_percent)*sample_size*2. The number 2 is used in the formula to ensure that enough records are read to produce a valid random sample size. For example, if you have 1,000,000 records and you specify a sample size of 2,000 and a percent of 5, the sample returns 2,000 records and read, at most, 80,000 records ((100/5)*2,000*2=80,000).
  • Publish results to catalog - Publishes discovery results to the catalog to make them available to other users. Results include term assignments, data class assignments, quality score, and other quality-related statistics.
Supported connectors and data sources
You can run automated discovery for Amazon S3, BigQuery, Db2, Greenplum, HDFS, Hive, Microsoft SQL Server, MongoDB, Netezza, Oracle, PostgreSQL, Snowflake, and Teradata data sources, and data sources that you connect to through the ODBC, JDBC, or File Connector connections. For information about running automated discovery from the command line, see Using automated discovery from the command line.
Required project roles
To analyze columns and data quality, you must have the Data Operator project role in the project that you select when starting a discovery.
To publish analysis results, you must have the Business Analyst project role in the project that you select when starting a discovery.

Procedure

  1. Go to Connections > Data discovery > New discovery job > Automated discovery.
  2. Select the data connection that you want to discover data for.
  3. In the Discovery root field, select the asset on which you want to start the discovery. Click Browse and select assets.
    Alternatively, type the discovery root manually. For example, for the HDFS connection type, specify a path to the root folder, as in /apps/hive/warehouse. Additionally, you can specify file extensions, like csv or txt. For a database connection type, specify schemas or database tables. You can discover a full database, or individual schemas and database tables. For JDBC connection types, the database name to use might differ from the actual database name, depending on the JDBC driver. For example, for Db2 you might need to use db2, and for Oracle ibm. Provide the value in the following format:
    • To discover a full database, leave the field blank.
    • To discover a schema, enter schema[db_name|schema_name].
    • To discover a database table, enter table[db_name|schema_name|table_name].
    • To discover more than one schema or table, separate items with semi-colons, as in schema[db_name|schema_name];table[db_name|schema_name|table_name].
  4. Select the discovery options that you want to run during the discovery process.
  5. Select the host.
  6. Select the project where you want to add the imported data. Click the settings icon to go to the project settings and modify the analysis settings. You can set column analysis parameters, data sample settings, engine settings, or data quality settings. For more information, see the Project settings topic.
    Note: Sampling settings that are specified in the project are used only when no sampling settings are specified in the Discover assets view.
  7. Click Discover.
    The discovery process might take a while, depending on the amount of data to import and analyze. Click the Refresh icon until the results are displayed.
  8. Optional: If you no longer need to run this discovery, you can cancel it when the analyze phase is in progress. You can't cancel the discovery during the metadata import. You can cancel the analysis of individual assets, or of all assets which are being analyzed. Open the discovery job details in Connections > Data discovery > View automated discovery results and click the cancel icon, either for all assets, or for selected assets in the Actions column.

What to do next

If you selected the option to publish the results to the catalog automatically, you can search for the assets in the catalog.