Random sampling concepts
In general, the sampling types random, row, and block are supported in IBM Knowledge Catalog. Several conditions define how the sample is composed.
For connected data assets, it is checked whether the connector supports pushdown of sampling to the data source. If the sampling type is supported, sampling happens at the data source.
If the connector does not support any of these sampling types, the sample is generated as follows:
-
If the total number of records (actual or approximated) in the data asset is available, Bernoulli sampling is used.
- The percentage of records to be sampled is calculated by using this formula: (requested_sample_size/total_number_of_records)*100
- Records are read in batches of 10,000 and, by using randomization, records are picked from each batch with the calculated percentage.
By default, the total record count is not retrieved during profiling. An administrator can enable this option for the Cloud Pak for Data deployment.
-
If the total number of records is not available, a percentage for Bernoulli sampling cannot be calculated. In that case, 80% of the records in each batch of 10,000 read records are picked for the sample until the required sample size is reached.
For example, if you have a table with 10,000,000 records and a random sample of 50,000 records is needed, 80% of records are fetched from each batch of 10,000 records, which makes 8,000 records per batch in this case. So, to get a sample of 50,000 records, about 7 batches of 10,000 records are read.