Data mining — Cost matrix

If, by default, all misclassifications had equal weights, target values (class labels) that appear less frequently would not be privileged. You might obtain a model that misclassifies these less frequent target values while achieving a very low overall error rate. To improve classification decision trees and to get better models with such 'skewed data', the Tree heuristic automatically generates an appropriate cost matrix to balance the distribution of class labels when a decision tree is trained. You can also manually adjust the cost matrix.

A cost matrix (error matrix) is also useful when specific classification errors are more severe than others. The Classification mining function tries to avoid classification errors with a high error weight. The trade-off of avoiding 'expensive' classification errors is an increased number of 'cheap' classification errors. Thus, the number of errors increases while the cost of the errors decreases in comparison with the same classification without a cost matrix. Weights specified must be greater than or equal to zero. The default weight is 1. The cost matrix diagonal must be zero.

You can assign error weights to misclassifications by specifying a cost matrix. The following table shows an example of a cost matrix. In this example, the following class labels are used:

High risk
Low risk
Safe

The error weight for classifying customer data as Safe, when it is actually High risk, is 7.0. The misclassification of Low risk as Safe has only an error weight of 3.0.

Table 1. Sample cost matrix table
Actual predicted weight
`High risk` `Safe`	7.0
`Low risk` `Safe`	3.0

Example:

Your input data might contain information about customers. 99% of these customers are satisfied, and 1% of these customers are not satisfied. You might want to build a model that predicts whether a customer is satisfied by using only a small training set of data. If you use only a small set of training data, you might obtain a degenerated pruned tree. This tree might consist only of one node that predicts that all of the customers are satisfied. This model seems to be of high quality because the error rate is very low (1%). However, to understand which attribute values describe a customer who is not satisfied, a different behavior is required.

You might want to enforce that the misclassification of a customer who is not satisfied is considered ten times as expensive as the misclassification of a customer who is satisfied.