Discretization for numeric fields
| Group ID | checking account | saving account | credit card | loan | custody account |
|---|---|---|---|---|---|
| Smith | 122.76 | 25,183.15 | 1 | - | - |
| Jackson | -3,000 | - | 5 | long term | high activity |
| Douglas | 11,877.43 | - | - | - | low activity |
The probability that two customers have the same balance is rather low. Therefore an Associations mining run or a Sequence Rules mining run on these numeric fields might be useless because the balance is part of the item name. Almost each customer generates his own items.
- The most left interval
[field name < boundary {=1(5)}]- The intervals in the middle
[field name >= lowerBoundary AND < upperBoundary {=2(5)}]- The most right interval
[field name >= boundary {=5(5)}]
'checking account', the following
items might result from the field 'checking account': [checking account <-5000 {=1(5)}]
[checking account >=-5000 AND <0 {=2(5)}]
[checking account >=0 AND <5000 {=3(5)}]
[checking account >=5000 AND <10000 {=4(5)}]
[checking account >=10000 AND <15000 {=5(5)}]
- The following operators are preceded by blanks:
- '>'
- '>='
- '<'
- 'AND'
- 'AND' is language-independent
- The number format of the boundary values is language-independent. The decimal separator is a dot, for example: '122.76'
- Discretization into 5 bins based on mean values and standard-deviation values
- By default, numeric item fields of type floating point or integer that contain more than 100 different values are binned into 5 bins. The bin boundaries are determined like this:
- Up to 10000 training data records are read. These training data records provide 10000 numeric values of the item field to be binned.
- From these 10000 values, mean values (
m) and standard-deviation values (s) are calculated. The mean values and the standard-deviation values are rounded to avoid too many trailing digits. For example, a value of 71.64198 might be rounded to 71.6. - Based on the rounded values, the following bins are created:
- ( -∞, m-3s/2 (
- The value is very low below the average value.
- ( m-3s/2, m-s/2 (
- The value is below the average value.
- ( m-s/2, m+s/2 (
- The value is close to the average value.
- ( m+s/2, m+3s/2 (
- The value is above the average value.
- ( m+3s/2, +∞ (
- The value is high above the average value.
- Customizing the default discretization mode
- You can change the default binning method like this:
- You can change the default number of records to be read for the
determination of the mean value and the standard-deviation value by
using the parameter
<NumSampleRecordsForBins>in the methodDM_setAlgorithm.For example, the following command increases the number of records to be read for calculating the mean value and the standard-deviation value for the default algorithm
SIDEto 20000:DM_RuleSettings..DM_setAlgorithm ('SIDE','<NumSampleRecordsForBins>20000</NumSampleRecordsForBins>') - You can change the default number of bins for all numeric
fields by using the parameter
<NumBins>in the methodDM_setAlgorithm.For example, the following command increases the default number of bins for the default algorithm
SIDEto 12 and the number of records to be read to 20000:DM_RuleSettings..DM_setAlgorithm ('SIDE','<NumBins>12</NumBins> <NumSampleRecordsForBins>20000</NumSampleRecordsForBins>') - You can change the number of bins for single item fields by using
the method
DM_setFldNumBins.For example, the following command increases the number of bins for the field
INCOMEto 20:DM_LogicalDataSpec..DM_setFldNumBins('INCOME',20)
Depending on the actual numberNof bins that is specified for a given item field, the bin widths and the bin boundaries are adjusted like this:- If N < 8
- Bin width = standard deviation. Bins are symmetrically centered around mean
- If 7 < N < 15
- Bin width = standard deviation/2. Bins are symmetrically centered around mean.
- If N > 14
- Bin width = standard deviation/4. Bins are symmetrically centered around mean.
- You can change the default number of records to be read for the
determination of the mean value and the standard-deviation value by
using the parameter
- Discretizing into equidistant bins
- You can specify a lower boundary value and an upper boundary value
for equidistant bins by using the command
DM_setFldOutlLim.For example, if you want to define equidistant bins for the field
INCOMEbetween the lower boundary value 10000 and the upper boundary value 70000, you can use the following command:DM_LogicalDataSpec..DM_setFldOutlLim ('INCOME',10000,70000)The command above creates 5 bins. There are 2 outlier bins and 3 equidistant bins between 10000 and 70000:
- <10000 (outlier bin)
- (10000, 30000(
- (30000, 50000(
- (50000, 70000(
- ≥70000 (outlier bin)
- You can combine the commands
DM_setFldOutlLimandDM_setFldNumBinsto modify the number of equidistant bins.For example, the following commands create 6 equidistant bins of width 10000 between 10000 and 70000 plus two outlier bins for the field
INCOME:DM_LogicalDataSpec..DM_setFldOutlLim('INCOME',10000,70000) DM_LogicalDataSpec..DM_setFldNumBins('INCOME',8)
- You can specify a lower boundary value and an upper boundary value
for equidistant bins by using the command
- Changing the threshold for discretization
- By default, the discretization of numeric fields starts if a numeric
field includes more than 100 different values. You can change this
value by using the power option parameter
-MAX_DISCR_COUNT.For example, if you use the following command, the discretization of the numeric fields starts if more than 20 different values are found:
DM_RuleSettings..DM_setPowerOptions('-MAX_DISCR_COUNT 20')