Random split operator

With the Random Split operator, you can split the data of a virtual table into two parts: A training data set and a test data set.

In the majority of cases, the Random Split operator is used in conjunction with the Predictor operator. For this operator, you typically have an input table with historical data. Therefore, you might want to split this table into the following disjoint data sets:

One data set to train the prediction model. This data set is available via the Training Output port.
One data set to test the prediction model. This data set is available via the Test Output port.

To control the distribution of the data between training and test data set, you specify the percentage value for the test data set.

Stratified Sampling

With Stratified Sampling, you can create an equally distributed training sample partition based on the column that is selected for stratified sampling (Stratified Sampling column). Because this technique tries to sample each value with approximately the same frequency as the Stratified Sampling column, it is called disproportional stratified sampling. In proportional stratified sampling, the original frequency is maintained. Proportional stratified sampling is not supported by Data warehousing in Db2.

Typically, the Stratified Sampling column is the column that is selected as the target column in a classifier. The stratified sampling is performed only for the training partition. The test partition is sampled based on the original data distribution. Otherwise, an independent evaluation of the model is not possible.

Creating a stratified sample is useful if the values of the target column in a classification task are not balanced. For example, in a two-class classification task, the first value occurs with a frequency of 99% and the second value occurs with a frequency of only 1%. Some classifiers might predict always the first value. This results in a very good accuracy of 99%. However, often, this might not be desired because the 1% value, for example, fraudulent customers, is the important value to predict. With stratified sampling, you can give both values the same weight during the training run. If the total number of the value with the lowest frequency is too small, the training data is not representative enough to create a good data mining model.

Example: A data set contains 100 records. The Gender column is specified as the Stratified Sampling column. In 80% of the cases, females occur. In 20% of the cases, men occur. The test partition is set to a size of 40%.

The test partition is sampled from the complete data set to represent the overall distribution. This results in a test partition of a size of 40 with 32 females (80%) and 8 males (20%) . Because random functions are used, the real numbers might be slightly different. The rest of the data set that is available for the training partition is divided into multiple data sets based on the available values for the Stratified Sampling column.

In this example, 2 partitions are created. One partition contains 48 (80%) records for females. The other partition contains 12 (20%) records for men. Now the disproportional stratification takes place by sampling about 12 records from the females to assure an equally distributed training set for all values of the Stratified Sampling column. Therefore 12 females and 12 males are included in the training partition and 32 females and 8 males are included in the test partition.

Recommendation: Applying models that are computed on a stratified sample should be done with care because the confidence values that are computed by the classification model are not correct on a data set that does not have the same characteristics as the stratified sample. Correcting these confidence values requires expert knowledge in data mining. Therefore, it is recommended to use the mining operators on a non-stratified sample. They automatically take into account the differences of the frequencies of the target values.

If you do not specify the Stratified Sample column, the training data set contains the records of the complement of the test data set.

The following mining flow shows how you can split your data into a training and a test data set, create a prediction model from the training data, and use a tester operator and the test data set to test the model.

Figure 1. Example that shows how to split data into a training and a test data set

Example that shows how to split data into a training data set and a test data set

Feedback