IBM Cloud Pak® for Data Version 4.7 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.7 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.
Advanced data profiling (Watson Knowledge Catalog)
Advanced profiling provides more accurate results than regular profiling but takes longer to complete because large amounts of data must be processed.
Advanced data profiling is available starting with Cloud Pak for Data 4.7.2.
To run advanced data profiling on one or more assets:
-
Open the metadata enrichment asset.
-
On the Assets tab, select assets as required.
-
Select Enrich > Run advanced data profiling from the toolbar.
-
Optional: Customize settings.
-
Select whether you want to write frequency distribution information to a database table and determine how many of distinct values you want to capture.
Without an output table, the first 100 distinct values are captured and stored internally. You can view and download that information from the Statistics page of a column profile.
If you choose to write frequency distribution information to a table, enable the External output option. The section is prepopulated with the default enrichment settings. See Advanced profiling settings. You can change the settings as required for this individual advanced profiling run. If you change the output table, you can also set this table as the new default location, thus overwriting the previous default setting.
You can access this table by using standard database queries. Alternatively, you can add the table to your project as a data asset and access it from there. For more information, see Frequency distributions.
-
Select a sampling type. Basic, Moderate, and Comprehensive sampling work in the same way as for basic profiling. For Custom sampling, the options are slightly different:
- Choose between sequential and random sampling. With sequential sampling, the first rows of a data set are selected in a sequential order. With random sampling, the rows to be included are randomly selected. For both methods, the maximum number of rows to be selected is determined by the defined sample size. Random sampling is available only for data assets from data sources that support this type of sampling.
- Define the maximum size of the sample. For sequential sampling, you can set a fixed number of rows. For random sampling, you can specify how many percent of the rows in the data set you want to be analyzed and you can optionally set the maximum number of rows that the sample can include. You might want to set these values when you don't know the size of the data sets to be analyzed. The percentage of rows selected for the sample can only approximate the specified percentage.
- Select whether you want a data class to be assigned based on all values in a column or on the most frequent values in a column where you can specify the number of values you want to be taken into account.
To suppress sampling, use custom sampling that is configured with random sampling and a sample size of 100%.
-
-
Click Run. You are notified when the analysis is complete.
Learn more
Parent topic: Enriching your data assets