Discretization for numeric fields

In a real scenario, there are much more different balances in the fields "checking account" or "saving account" than shown in the following table:
Table 1. The most inflexible table layout - advanced
Group ID checking account saving account credit card loan custody account
Smith 122.76 25,183.15 1 - -
Jackson -3,000 - 5 long term high activity
Douglas 11,877.43 - - - low activity

The probability that two customers have the same balance is rather low. Therefore an Associations mining run or a Sequence Rules mining run on these numeric fields might be useless because the balance is part of the item name. Almost each customer generates his own items.

To circumvent the problem of numeric fields, the Intelligent Miner® provides a discretization mechanism for numeric fields. By default, if a numeric field contains more than 100 different values, the value range is divided in buckets. If a discretization of a numeric field occurs, the resultant item names might look like these:
The most left interval
[field name < boundary {=1(5)}]
The intervals in the middle
[field name >= lowerBoundary AND < upperBoundary {=2(5)}]
The most right interval
[field name >= boundary {=5(5)}]
For example, if there is a table with more than 100 different balances in the field 'checking account', the following items might result from the field 'checking account':
[checking account <-5000 {=1(5)}]
[checking account >=-5000 AND <0 {=2(5)}]
[checking account >=0 AND <5000 {=3(5)}]
[checking account >=5000 AND <10000 {=4(5)}]
[checking account >=10000 AND <15000 {=5(5)}]
Note:
  • The following operators are preceded by blanks:
    • '>'
    • '>='
    • '<'
    • 'AND'
  • 'AND' is language-independent
  • The number format of the boundary values is language-independent. The decimal separator is a dot, for example: '122.76'
Discretization into 5 bins based on mean values and standard-deviation values
By default, numeric item fields of type floating point or integer that contain more than 100 different values are binned into 5 bins. The bin boundaries are determined like this:
  1. Up to 10000 training data records are read. These training data records provide 10000 numeric values of the item field to be binned.
  2. From these 10000 values, mean values (m) and standard-deviation values (s) are calculated. The mean values and the standard-deviation values are rounded to avoid too many trailing digits. For example, a value of 71.64198 might be rounded to 71.6.
  3. Based on the rounded values, the following bins are created:
    ( -∞, m-3s/2 (
    The value is very low below the average value.
    ( m-3s/2, m-s/2 (
    The value is below the average value.
    ( m-s/2, m+s/2 (
    The value is close to the average value.
    ( m+s/2, m+3s/2 (
    The value is above the average value.
    ( m+3s/2, +∞ (
    The value is high above the average value.
Customizing the default discretization mode
You can change the default binning method like this:
  • You can change the default number of records to be read for the determination of the mean value and the standard-deviation value by using the parameter <NumSampleRecordsForBins> in the method DM_setAlgorithm.

    For example, the following command increases the number of records to be read for calculating the mean value and the standard-deviation value for the default algorithm SIDE to 20000:

    DM_RuleSettings..DM_setAlgorithm
       ('SIDE','<NumSampleRecordsForBins>20000</NumSampleRecordsForBins>')
  • You can change the default number of bins for all numeric fields by using the parameter <NumBins> in the method DM_setAlgorithm.

    For example, the following command increases the default number of bins for the default algorithm SIDE to 12 and the number of records to be read to 20000:

    DM_RuleSettings..DM_setAlgorithm
       ('SIDE','<NumBins>12</NumBins>
                <NumSampleRecordsForBins>20000</NumSampleRecordsForBins>')
  • You can change the number of bins for single item fields by using the method DM_setFldNumBins.

    For example, the following command increases the number of bins for the field INCOME to 20:

    DM_LogicalDataSpec..DM_setFldNumBins('INCOME',20)
Depending on the actual number N of bins that is specified for a given item field, the bin widths and the bin boundaries are adjusted like this:
If N < 8
Bin width = standard deviation. Bins are symmetrically centered around mean
If 7 < N < 15
Bin width = standard deviation/2. Bins are symmetrically centered around mean.
If N > 14
Bin width = standard deviation/4. Bins are symmetrically centered around mean.
Discretizing into equidistant bins
  • You can specify a lower boundary value and an upper boundary value for equidistant bins by using the command DM_setFldOutlLim.

    For example, if you want to define equidistant bins for the field INCOME between the lower boundary value 10000 and the upper boundary value 70000, you can use the following command:

    DM_LogicalDataSpec..DM_setFldOutlLim
       ('INCOME',10000,70000)

    The command above creates 5 bins. There are 2 outlier bins and 3 equidistant bins between 10000 and 70000:

    • <10000 (outlier bin)
    • (10000, 30000(
    • (30000, 50000(
    • (50000, 70000(
    • ≥70000 (outlier bin)
  • You can combine the commands DM_setFldOutlLim and DM_setFldNumBins to modify the number of equidistant bins.

    For example, the following commands create 6 equidistant bins of width 10000 between 10000 and 70000 plus two outlier bins for the field INCOME:

    DM_LogicalDataSpec..DM_setFldOutlLim('INCOME',10000,70000) 
    DM_LogicalDataSpec..DM_setFldNumBins('INCOME',8)
Changing the threshold for discretization
By default, the discretization of numeric fields starts if a numeric field includes more than 100 different values. You can change this value by using the power option parameter -MAX_DISCR_COUNT.

For example, if you use the following command, the discretization of the numeric fields starts if more than 20 different values are found:

DM_RuleSettings..DM_setPowerOptions('-MAX_DISCR_COUNT 20')