Examples for creating Naive Bayes models
This example shows how to define a discretization algorithm and the number of bins.
The customer churn data set contains only numeric columns. However, a VARCHAR column is needed to create a Naive Bayes model. The following call shows how to create a view that is based on the customer churn data set and contains a VARCHAR column. Additionally, this view contains the columns that are needed for the predictions.
CREATE VIEW CUSTOMER_CHURN AS SELECT (CASE WHEN SAMPLES.CUSTOMER_CHURN."CENSOR" = '1' THEN 'YES' ELSE 'NO' END) CHURN, IN_B2B_INDUSTRY, TOTAL_BUY, ANNUAL_REVENUE_MIL, CUST_ID AS ID from SAMPLES.CUSTOMER_CHURN;
A table that contains training data is needed for Naive Bayes prediction. The following call shows
how to create this table from the CUSTOMER_CHURN view:
CALL IDAX.SPLIT_DATA('intable=customer_churn, traintable=customer_churn_training, testtable=customer_churn_testing, id=id, fraction=0.65');
The following example shows how to define a discretization algorithm and the number of bins:
CALL IDAX.NAIVEBAYES('intable=customer_churn_training, disc=ew, bins=20, id=id, target=churn, model=cc_nb_ewd');
CALL IDAX.NAIVEBAYES('intable=customer_churn_training, disc=ef, bins=20, id=id, target=churn, model=cc_nb_efd');
CALL IDAX.NAIVEBAYES('intable=customer_churn_training, disc=ewn, id=id, target=churn, model=cc_nb_ewnd');
The created model is represented by a table that contains one row for each attribute-value-class combination and the following columns:
- attribute
- The attribute
- val
- The value
- class
- The class
- classvalcount
- The number of occurrences of the corresponding value of the corresponding attribute within the class in the training set
- classcount
- The number of occurrences of the class in the training set
- attrclasscount
- The number of not-null occurrences of the class in the training set
- totalcount
- The number of all instances in the training set
These parameters are sufficient to calculate prior class probabilities and conditional attribute value probabilities that you can use for Naive Bayes prediction.
To generate test set predictions, you can use the resulting models as follows:
CALL IDAX.PREDICT_NAIVEBAYES('model=cc_nb_ewd, intable=customer_churn_testing, outtable=customer_churn_nb_ewd, outtableprob = customer_churn_nb_ewd_prob');
CALL IDAX.PREDICT_NAIVEBAYES('model=cc_nb_efd, intable=customer_churn_testing, outtable=customer_churn_nb_efd, outtableprob = customer_churn_nb_efd_prob');
CALL IDAX.PREDICT_NAIVEBAYES('model=cc_nb_ewnd, intable=customer_churn_testing, outtable=customer_churn_nb_ewnd,outtableprob = customer_churn_nb_ewnd_prob');
The test sets for each model application are discretized by using the same intervals as for the corresponding training sets. The specified output table contains the generated predictions. It contains the id column and class column.
Additionally, another output table is created, where the _prob suffix is appended to the name. This output table contains one row for each instance and possible class, and the following columns:
- id
- The instance identifier
- class
- The class
- prob
- The Bayesian numerator of the instance class
The numerator results from the class probablility and the conditional instance attribute value probabilities.
- lnprob
- The natural logarithm of prob
You can use these probabilities to obtain better insight into model predictions or to modify the model operation. You can convert the Bayesian numerators to regular probabilities by normalization. Normalization means the division by the sum of numerators for the same instance and all classes.
The following query shows an example of adding the class probability to each instance and its predicted class label for the third model that was generated by using minimum entropy discretization:
SELECT S.id, S.class, S.prob/S.sump as prob
FROM (SELECT id, class, prob, sum(prob) OVER (PARTITION BY id) AS sump FROM customer_churn_nb_ewnd_prob) S, customer_churn_nb_ewnd P
WHERE S.id=P.id AND S.class=P.class;
To evaluate the quality of the obtained predictions, you can calculate the misclassification error as follows:
CALL IDAX.CERROR('intable=customer_churn_nb_ewd, resulttable=customer_churn_testing, id=id, target=class, resulttarget=churn');
CALL IDAX.CERROR('intable=customer_churn_nb_efd, resulttable=customer_churn_testing, id=id, target=class, resulttarget=churn');
CALL IDAX.CERROR('intable=customer_churn_nb_ewnd, resulttable=customer_churn_testing, id=id, target=class, resulttarget=churn');
Although these models behave similarly, the first one, which is based on equal-frequency discretization, is not as accurate as the other two. Moreover, the misclassification error level, which is greater than 0.37 for all models, leads to the conclusion that the Naive Bayes classifier is not well-suited for the customer churn data set.
To avoid getting no probabilities at all during the prediction, you can enable the m-estimation technique.
To enable this technique, you specify the mestimation=TRUE argument for the PREDICT_NAIVEBAYES procedure as follows:
CALL IDAX.PREDICT_NAIVEBAYES('model=cc_nb_ewnd, intable=customer_churn_testing, mestimation=TRUE, outtable=customer_churn_nb_ewnd_mest');
CALL IDAX.CERROR('pred_table=customer_churn_nb_ewnd_mest, true_table=customer_churn_testing, pred_id=id, true_id=id, pred_column=class, true_column=churn');
The analysis shows that the technique has little impact on the quality of predictions that are generated for the customer_churn data.
To do more in-depth prediction quality analysis, you can use the confusion matrix and the derived quality indicators. For the third model proceed as follows:
CALL IDAX.CONFUSION_MATRIX('intable=customer_churn_testing, resulttable=customer_churn_nb_ewnd, id=id, target=churn, matrixTable=cc_churn_nb_ewnd_cm');
CALL IDAX.CMATRIX_STATS('matrixTable=cc_churn_nb_ewnd_cm');
The result sheds new light on the quality of Naive Bayes predictions for the customer churn data. The large misclassification error translates to a high false positive rate of nearly 0.55, but also to a surprisingly high true positive rate of nearly 0.77, where a customer churn is considered as positive.
You can define all columns by types and roles as shown in the following statement:
CALL IDAX.NAIVEBAYES('model=cc_nb1, intable=customer_churn_training, id=id, target=churn');
This statement is equal to the following one:
CALL IDAX.NAIVEBAYES('model=cc_nb2, intable=customer_churn_training, incolumn=id:id, target=churn');
To define nominal attributes and continuous attributes manually, you can use the nom types and the cont types:
CALL IDAX.NAIVEBAYES('model=cc_nb3, intable=customer_churn_training,incolumn=IN_B2B_INDUSTRY:nom;TOTAL_BUY:cont;CHURN:target;ID:id');
You can also create one-column properties definitions and pass them to the model several times:
CALL IDAX.COLUMN_PROPERTIES('intable=customer_churn_training, outtable=cc_columns, coldeftype=cont, incolumn=IN_B2B_INDUSTRY:nom;TOTAL_BUY:cont;ANNUAL_REVENUE_MIL:cont:ignore;CHURN:nom:target;ID:id');
select * from cc_columns;
CALL IDAX.NAIVEBAYES('model=cc_nb4, intable=customer_churn_training,colPropertiesTable=cc_columns');