Data mining — Discretization for numeric fields

This section describes the default use of discretization for numeric fields and provides a scenario for when discretization should be used.

In a real scenario, there are many different balances in the fields "checking account" and "saving account" in example 4 (Table 6). The probability that two customers have the same balance is rather low. Therefore a mining run with those numeric fields might be useless because the balance is part of the item name and almost each customer generates his own items.

To circumvent the problem of numeric fields, the Intelligent Miner® provides a discretization mechanism for numeric fields. If a numeric field contains more than 20 different values, the value range is divided in buckets. The number of buckets and the boundaries of the buckets are automatically determined by the mining algorithm. By default, 5 buckets are created. You can modify the default value on the Mining Settings page of the Associations properties or the Sequences properties.

If a discretization of a numeric field occurs, the resultant item names are as follows:

The most left interval: [field name: < boundary]
Interval in the middle: [field name: >= lower boundary AND < upper Boundary]
The most right interval: [field name: >= boundary]

For example, if there are more customer entries in Table 6 of example 4 with more than 20 different balances in the field 'checking account', the items resulting from the field 'checking account' might be:

[checking acccount: < -5000]
[checking account: >= -5000 AND < 0]
[checking account: >= 0 AND < 5000]
[checking account: >= 5000 AND < 10000]
[checking account: >= 10000 AND < 15000]
[checking account: >= 15000 AND < 20000]
[checking account: >= 20000 AND < 25000]
[checking account: >= 25000 AND < 30000]
[checking account: >= 30000 AND < 35000]
[checking account: >= 35000 AND < 40000]
[checking account: >= 40000]

Note:

The following operators are surrounded by blanks: '>' '>=' '<' 'AND'
'AND' is language-independent
The number format of the boundary values is language-independent; the decimal separator is a dot, for example: '122.76'