In a real scenario, there are much more different balances in the
fields "checking account" or "saving account" than shown in the following
table:
Table 1. The most inflexible table layout - advanced| Group ID |
checking account |
saving account |
credit card |
loan |
custody account |
| Smith |
122.76 |
25,183.15 |
1 |
- |
- |
| Jackson |
-3,000 |
- |
5 |
long term |
high activity |
| Douglas |
11,877.43 |
- |
- |
- |
low activity |
The probability that two customers have the same balance is rather
low. Therefore an Associations mining run or a Sequence Rules mining
run on these numeric fields might be useless because the balance is
part of the item name. Almost each customer generates his own items.
To circumvent the problem of numeric fields, the Intelligent Miner® provides a discretization mechanism for numeric fields.
By default, if a numeric field contains more than 100 different values,
the value range is divided in buckets. If a discretization of a numeric
field occurs, the resultant item names might look like these:
- The most left interval
- [field name < boundary {=1(5)}]
- The intervals in the middle
- [field name >= lowerBoundary AND < upperBoundary
{=2(5)}]
- The most right interval
- [field name >= boundary {=5(5)}]
For example, if there is a table with more than 100 different
balances in the field
'checking account', the following
items might result from the field
'checking account':
[checking account <-5000 {=1(5)}]
[checking account >=-5000 AND <0 {=2(5)}]
[checking account >=0 AND <5000 {=3(5)}]
[checking account >=5000 AND <10000 {=4(5)}]
[checking account >=10000 AND <15000 {=5(5)}]
Note: - The following operators are preceded by blanks:
- 'AND' is language-independent
- The number format of the boundary values is language-independent.
The decimal separator is a dot, for example: '122.76'
- Discretization into 5 bins based on mean values and standard-deviation
values
By default, numeric item fields of type floating point or integer
that contain more than 100 different values are binned into 5 bins.
The bin boundaries are determined like this:
- Up to 10000 training data records are read. These training data
records provide 10000 numeric values of the item field to be binned.
- From these 10000 values, mean values (m) and
standard-deviation values (s) are calculated. The
mean values and the standard-deviation values are rounded to avoid
too many trailing digits. For example, a value of 71.64198 might be
rounded to 71.6.
- Based on the rounded values, the following bins are created:
- ( -∞, m-3s/2 (
- The value is very low below the average value.
- ( m-3s/2, m-s/2 (
- The value is below the average value.
- ( m-s/2, m+s/2 (
- The value is close to the average value.
- ( m+s/2, m+3s/2 (
- The value is above the average value.
- ( m+3s/2, +∞ (
- The value is high above the average value.
- Customizing the default discretization mode
You can change the default binning method like this:
- You can change the default number of records to be read for the
determination of the mean value and the standard-deviation value by
using the parameter <NumSampleRecordsForBins> in
the method DM_setAlgorithm.
For example, the
following command increases the number of records to be read for calculating
the mean value and the standard-deviation value for the default algorithm SIDE to
20000:
DM_RuleSettings..DM_setAlgorithm
('SIDE','<NumSampleRecordsForBins>20000</NumSampleRecordsForBins>')
- You can change the default number of bins for all numeric
fields by using the parameter <NumBins> in
the method DM_setAlgorithm.
For example, the following
command increases the default number of bins for the default algorithm SIDE to
12 and the number of records to be read to 20000:
DM_RuleSettings..DM_setAlgorithm
('SIDE','<NumBins>12</NumBins>
<NumSampleRecordsForBins>20000</NumSampleRecordsForBins>')
- You can change the number of bins for single item fields by using
the method DM_setFldNumBins.
For example, the
following command increases the number of bins for the field INCOME to
20:
DM_LogicalDataSpec..DM_setFldNumBins('INCOME',20)
Depending on the actual number
N of bins
that is specified for a given item field, the bin widths and the bin
boundaries are adjusted like this:
- If N < 8
- Bin width = standard deviation. Bins are symmetrically centered
around mean
- If 7 < N < 15
- Bin width = standard deviation/2. Bins are symmetrically centered
around mean.
- If N > 14
- Bin width = standard deviation/4. Bins are symmetrically centered
around mean.
- Discretizing into equidistant bins
- You can specify a lower boundary value and an upper boundary value
for equidistant bins by using the command DM_setFldOutlLim.
For
example, if you want to define equidistant bins for the field INCOME between
the lower boundary value 10000 and the upper boundary value 70000,
you can use the following command:
DM_LogicalDataSpec..DM_setFldOutlLim
('INCOME',10000,70000)
The command above creates
5 bins. There are 2 outlier bins and 3 equidistant bins between 10000
and 70000:
- <10000 (outlier bin)
- (10000, 30000(
- (30000, 50000(
- (50000, 70000(
- ≥70000 (outlier bin)
- You can combine the commands DM_setFldOutlLim and DM_setFldNumBins to
modify the number of equidistant bins.
For example, the following
commands create 6 equidistant bins of width 10000 between 10000 and
70000 plus two outlier bins for the field INCOME:
DM_LogicalDataSpec..DM_setFldOutlLim('INCOME',10000,70000)
DM_LogicalDataSpec..DM_setFldNumBins('INCOME',8)
- Changing the threshold for discretization
- By default, the discretization of numeric fields starts if a numeric
field includes more than 100 different values. You can change this
value by using the power option parameter -MAX_DISCR_COUNT.
For
example, if you use the following command, the discretization of the
numeric fields starts if more than 20 different values are found:
DM_RuleSettings..DM_setPowerOptions('-MAX_DISCR_COUNT 20')