Data mining — Naive Bayes classification

The Naive Bayes classification algorithm is a probabilistic classifier. It is based on probability models that incorporate strong independence assumptions.

The independence assumptions often do not have an impact on reality. Therefore they are considered as naive.

You can derive probability models by using Bayes' theorem (credited to Thomas Bayes). Depending on the nature of the probability model, you can train the Naive Bayes algorithm in a supervised learning setting.

Data mining in InfoSphere™ Warehouse is based on the maximum likelihood for parameter estimation for Naive Bayes models. The generated Naive Bayes model conforms to the Predictive Model Markup Language (PMML) standard.

A Naive Bayes model consists of a large cube that includes the following dimensions:

Input field name
Input field value for discrete fields, or input field value range for continuous fields.
Continuous fields are divided into discrete bins by the Naive Bayes algorithm
Target field value

This means that a Naive Bayes model records how often a target field value appears together with a value of an input field.

You can activate the Naive Bayes classification algorithm by using the following command:

 DM_ClasSettings()..DM_setAlgorithm('NaiveBayes')

The Naive Bayes classification algorithm includes the probability-threshold parameter ZeroProba. The value of the probability-threshold parameter is used if one of the above mentioned dimensions of the cube is empty. A dimension is empty, if a training-data record with the combination of input-field value and target value does not exist.

The default value of the probability-threshold parameter is 0.001. Optionally, you can modify the probability threshold. For example, you can set the value to 0.0002 by using the following command:

DM_ClasSettings()..DM_setAlgorithm('NaiveBayes','<ZeroProba>0.0002</ZeroProba>')