Data mining — Definition of the input table layout

Typically, a training data table for Associations mining runs or for Sequence Rules mining runs consists of one item field or of several item fields. For Sequence Rules mining runs, there must also be a sequence field and a group field. The sequence field contains the transaction group IDs. The group field contains the transaction IDs or the time stamps.

The Association mining function and the Sequence Rules mining function can read groups of items out of tables with different table layouts. The following examples show the table layouts that are supported. They also show how these table layouts are defined in your settings.

The examples are based on a scenario in a bank. In the tables, the items represent the different products that the bank offers to its customers. The list of products a customer uses yields a group of items. In the examples, the following groups are used:

group1(Customer Smith): [checking account] [saving account] [credit card]
group2(Customer Jackson): [checking account] [credit card] [loan] [custody account]
group3(Customer Douglas): [checking account] [custody account]

The most flexible table layout

In the following table, there is a group ID field and an item field. Each record corresponds to one item in a group. This means that each group is represented by as many records as there are items in the group.

In this table, the group ID is mandatory because it is required to identify the items that belong to the same group.

This table shows the most flexible table layout because neither the maximum transaction length nor the number of different items per group are limited after the table format is defined.

Table 1. The most flexible table layout
Group ID	Item
Smith	checking account
Smith	saving account
Smith	credit card
Jackson	checking account
Jackson	credit card
Jackson	loan
Jackson	custody account
Douglas	checking account
Douglas	custody account

The settings for this table look like this:

IDMMX.DM_RuleSettings()..DM_useRuleDataSpec(...)
..DM_setGroup('Group ID')

Table layout with limited transaction length

In the table layout with limited transaction length and in the most inflexible table layout, each group is represented by exactly one record. The group ID is not required to find the items that belong to the same group. Therefore the group ID is optional for these table layouts for Associations. For Sequence Rules, the sequence and the group ID are mandatory.

In the following table, each item field contains the item at a given position in the group, or NULL if there is no such item in this group. The field value identifies the item, however, the field name is not relevant.

The table layout of this table limits the maximum transaction length to the number of item fields after the table is created.

Table 2. Table layout with limited transaction length
Group ID	Item1	Item2	Item3	Item4
Smith	checking account	saving account	credit card	-
Jackson	checking account	credit card	loan	custody account
Douglas	checking account	custody account	-	-

The settings for this table look like this:

IDMMX.DM_RuleSettings()..DM_useRuleDataSpec(...)
..DM_setItemFormat(1)
..DM_setFldUsageType('Group ID', 2)

The command DM_setFldUsageType deactivates the field Group ID. The group field is not required because each transaction corresponds to one row in the data.

The command DM_setItemFormat(1) sets the item format in the generated rules to format ID = 1. This means that only the item field values, for example, saving account, appear as items in the generated rules.

If there are more than one item columns and you do not set the item format to 1, the default value 0 is used. Item format 0 means that the items in the generated rules are compositions of item column name and item value, for example, Item2=saving Account.

The following table shows a variation of the table above. It includes only two item fields, however, a transaction is represented by more than one record.

Table 3. Variation of the table layout with limited transaction length
Group ID	Item1	Item2
Smith	checking account	saving account
Smith	credit card	-
Jackson	checking account	credit card
Jackson	loan	custody account
Douglas	checking account	custody account

If a transaction is represented by more than one record, a group ID must be defined. The settings for this table look like this:

IDMMX.DM_RuleSettings()..DM_useRuleDataSpec(...)
..DM_setItemFormat(1)
..DM_setGroup('Group ID')

Table layout with a multi-value item field

Table 4. Table layout with multi-value field
Group ID	Items
Smith	<item>checking account</item><item>saving account</item><item>credit account</item>
Jackson	<item>checking account</item><item>credit card</item><item>loan</item><item>custody account</item>
Douglas	<item>checking account</item><item>custody account</item>

The settings for this table look like this:

IDMMX.DM_RuleSettings()..DM_useRuleDataSpec(...)
   ..DM_setFldUsageType('Group ID',2)
   ..DM_setItemFormat(101)

The item format 101 specifies that, with exception of the deactivated Group ID field, the remaining fields are considered as item fields of the format 101. Format 101 is the multi-value format.

The most inflexible table layout - basic

In the following table, each item field corresponds to one item. This means that if your table contains N different items, there are N item fields. The field identifies the item, and the field value states if the item is part of the group. In this example, "Y" indicates that the item is a member of the group. NULL, an empty string, or a blank string indicates that it is not a member of the group.

Each string that is not NULL, empty, or blank is considered as 'Yes'-indicator. Therefore you can use any string as 'Yes'-indicator.

If you replace the following 'No' indicators by a non-empty string, see the advanced form of a most inflexible table layout below:

NULL
Empty string
Blank string

Table 5. The most inflexible table layout - basic
Group ID	checking account	saving account	credit card	loan	custody account
Smith	Y	Y	Y	-	-
Jackson	Y	-	Y	Y	Y
Douglas	Y	-	-	-	Y

The table layout above is the basic form of a most inflexible table layout because the maximum transaction length and the number of different items are set when the table is created.

The settings for the table above look like this:

IDMMX.DM_RuleSettings()..DM_useRuleDataSpec(...)
..DM_setItemFormat(2)
..DM_setFldUsageType('Group ID', 2)

The command DM_setItemFormat(2) specifies that all columns except the deactivated column Group ID are item columns of item format 2. Item format 2 means that only the item column names and not the item values are included in the generated rules.

Item format 2 applies only to bi-valued (binary) columns. In columns with format 2, the following values are interpreted as the item does not occur:

SQL NULL
Empty strings
-
false
Strings that start with n or N

Item values other than listed above are interpreted as the item occurs. In the example above, this means that the generated rules include items such as checking account. Also, the rule [checking account] ==> [credit card] has a support value of 67% because 2 of 3 customers support the rule.

The most inflexible table layout - advanced

The following table shows the advanced form of a most inflexible table layout. The field values provide additional information about the items, for example:

The balance of the checking and saving account
The number of credit cards
The type of the loan
A rating of the usage of the custody account

Some of the item fields contain numeric values. There is no restriction to categorical values.

Table 6. The most inflexible table layout - advanced
Group ID	checking account	saving account	credit card	loan	custody account
Smith	122.76	25,183.15	1	-	-
Jackson	-3,000	-	5	long term	high activity
Douglas	11,877.43	-	-	-	low activity

The settings for this table look like this:

IDMMX.DM_RuleSettings()..DM_useRuleDataSpec(...)
..DM_setItemFomat(0)
..DM_setFldUsageType('Group ID', 2)

Joined table layout

If you join two different tables for your input table, some entries might be duplicated in the resulting view. The duplicated items do not cause problems because the mining function ignores duplicated items in a group.

For example, you might want to join the following two tables:

Table 7. Additional customer data
Group ID	Customer access
Smith	Internet
Smith	ATM/Desk
Jackson	Internet
Douglas	Internet

Table 8. The most inflexible table layout - basic
Group ID	checking account	saving account	credit card	loan	custody account
Smith	Y	Y	Y	-	-
Jackson	Y	-	Y	Y	Y
Douglas	Y	-	-	-	Y

If you join the two tables above, the following table is created.

Table 9. Joined table layout
Group ID	Item1	Item2	Item3	Item4	Customer access
Smith	checking account	saving account	credit card	-	Internet
Smith	checking account	saving account	credit card	-	ATM/Desk
Jackson	checking account	credit card	loan	custody account	Internet
Douglas	checking account	custody account	-	-	Internet

The duplication of, for example, [credit card] in group 'Smith', does not affect the result because duplicated items in the group are eliminated for the calculation of the rules.

Item formats

Intelligent Miner® supports different ways how the item names in Association rule models or in Sequence rule models are built from the content and the column names of the training data table. These different ways are called item formats. Some of these item formats are described in Definition of the input table layout.

Intelligent Miner supports the following item formats:

Format 0

Field name = field value

Item names are built from the column name and the column values of the item columns in the training data. In this format, the following column values are ignored:

SQL NULL
Blanks
Empty strings

If there are more than one item fields, format 0 is the default format.

Format 1

Field value

Item names are identical to the column values of the item columns in the training data. In this format, the following column values are ignored:

SQL NULL
Blanks
Empty strings

If there is only one item field, format 1 is the default format.

Format 2

Field name

Item names are identical to the item column names of the training data table. This format is used for tables that contain several binary item columns with yes/no content. In this format, the following column values are considered as No:

SQL NULL
Blanks
Empty strings
-
f
Strings that start with n or N

Column values other than the values in the list above are considered as yes.

Format 10

Field name = field value

This format is another version of the format 0. Format 10 does not have restrictions, it accepts all item values.

Format 11

Field value

This format is another version of format 1. Format 11 does not have restrictions, it accepts all item values.

Format 101

Multi-value

This format is another version of format 1. Format 101 allows several item values to be given in one item column value. The item format looks like this:

<item>itemValue1</item>...<item>itemValueN</item>

To define the item format, you can use one of the following methods:

DM_RuleSettings..DM_setItemFormat( fieldName, formatID ): This method specifies the item format for a single field.
DM_RuleSettings..DM_setItemFormat( formatID ): This method specifies the item format for all item fields.

To retrieve the item format, you can use the method DM_getItemFormat. The method call looks like this:

DM_RuleSettings..DM_getItemFormat( fldName )