Random sampling concepts (IBM Knowledge Catalog)
In general, the sampling types random, row, and block are supported in IBM Knowledge Catalog. Several conditions define how the sample is composed.
For connected data assets, it is checked whether the connector supports pushdown of sampling to the data source. If the sampling type is supported, sampling happens at the data source.
If the connector does not support any of these sampling types, the sample is generated as follows:
-
If the total number of records in the data asset is available, Bernoulli sampling is used.
- The percentage of records to be sampled is calculated by using this formula: (requested_sample_size/total_number_of_records)*100
- Records are read in batches of 10,000 and, by using randomization, records are picked from each batch with the calculated percentage.
By default, the total record count is not retrieved during profiling. An administrator can enable this option for the Cloud Pak for Data deployment.
-
If the total number of records is not available, a percentage for Bernoulli sampling cannot be calculated. A default percentage of 80% is used for selecting the records to be included in the sample.
Learn more
Parent topic: Designing metadata enrichment