In the previous article, I examined how Bayes methods can be used to solve parameter estimation problems. You learned about maximum likelihood estimators, binomial random variables, Bernoulli processes, the beta distribution, and conjugate priors.
In this article, you will learn how to use Bayesian inference to solve classification problems in medical diagnostic testing and the analysis of binary classification surveys. You will begin by examining how conditional probability concepts can be used to assess classification performance in the context of binary classification surveys and medical diagnostic testing. You will then build and apply a Naive Bayes classifier to the results of binary classification surveys.
Bayesian and conditional probability concepts are applicable to both building classifier systems and analyzing the accuracy of their output.
Medical and survey classification
A seminal work in the area of classification theory is Classification and Regression Trees by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone (Kluwer Academic Publishers, 1984). In the opening paragraphs, the authors discuss an interesting medical situation in which large amounts of data are collected on patients and used for classification purposes:
At the University of California, San Diego Medical Center, when a heart attack patient is admitted, 19 variables are measured during the first 24 hours. These include blood pressure, age, and 17 other ordered and binary variables summarizing the medical symptoms considered as important indicators of the patient's condition.
The goal of a recent medical study was the development of a method to identify high risk patients (those who will not survive at least 30 days) on the basis of the initial 24-hour data.
.... This [tree-structured classification] rule classifies incoming patients as F [not high risk] or G [high risk] depending on the yes-no answers to, at most, three questions. Its simplicity raises the suspicion that standard statistical classification methods may give classification rules that are more accurate. When these are tried, the rules produced were considerably more intricate, but less accurate. (p. 1)
The final paragraph of this quote is interesting because it suggests that the goal of data analysis in classifier-oriented research should be to construct a classifier that is, first of all, accurate and, after that, one that is the simplest in terms of the number of variables that are used by the classifier.
I will now examine in more detail the various conditional probability metrics you can use to measure classification accuracy. The classification data that you will examine results from a simple binary classification survey; a survey with one binary test question, q1, and one binary classification question, c1. Later, when you build your Naive Bayes classifier, you will apply it to multivariate binary classification surveys (where the number of test questions qi is unlimited).
Design of a classification survey
I am reading a book, Fire and Ice: The United States, Canada, and the Myth of Converging Values, that compares Canadian and United States residents in terms of their common and diverging values systems. Based on the data reported in the book, I have reason to believe that the following question has some power to differentiate between Americans and Canadians in terms of their value orientations:
Do you agree with the following statement? "The father of the family must be master in his own home." (Y / N)
In 2000, when this question was last asked to a representative sample of Canadian and US residents, the results showed that 18 percent of Canadians agreed with this statement, whereas 49 percent of US residents agreed with it.
To determine how good this question is at differentiating between Canadian and US participants, you can conduct a simple binary classification survey with a two-stage design. In the first stage, participants answer a pre-survey question that asks them what country they belong to. In the second stage, the participants answer the main survey question as stated previously.
This type of survey design allows you to ascertain how effectively this question differentiates between Canadian and US respondents. Specifically, this design allows you to compute true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates, which are basic quantities useful in assessing the accuracy of your classifier and the diagnostic quality of your questions.
These quantities can also be readily used in various conditional probability formulas to compute other accuracy measures. The range of accuracy metrics we will be discussing are frequently used to assess the appropriateness and usefulness of medical tests and might also be expected to be useful in assessing the appropriateness and usefulness of questions in simple binary classification surveys.
The following code analyzes data obtained from four hypothetical survey participants. The first column of the $data matrix represents the response to the main survey question while the second column represents the response to the pre-survey question. The raw data from this survey is fed into the ClassifierDiagnostics class using this two-dimensional array format. You can imagine querying a survey table for such data, storing the result set in this array format, then feeding the array into this class for analysis.
Listing 1. load_raw_data.php script for array data analysis
<?php
/**
* load_raw_data.php
*
* Test constructor method by supplying constructor with two
* dimensional array of survey data.
*
* Store the survey responses as follows:
*
* $data[$i][0] = Main survey response for participant $i.
* $data[$i][1] = Pre-survey response for participant $i.
*/
require_once "../ClassifierDiagnostics.php";
$data[0] = array("1", "1");
$data[1] = array("0", "0");
$data[2] = array("1", "1");
$data[3] = array("0", "1");
$classifier = new ClassifierDiagnostics($data);
$classifier->setRowName("Patriarchy");
$classifier->setRowTrue("Agree");
$classifier->setRowFalse("Disagree");
$classifier->setColumnName("Country");
$classifier->setColumnTrue("US");
$classifier->setColumnFalse("CAN");
$classifier->showCrossTabs();
?>
|
The showCrossTabs() method outputs the joint-frequency distribution and joint-probability distribution of responses to the patriarchy question and county-classification question. These joint-frequency tables are also commonly referred to as confusion matrices.
Table 1. showCrossTabs outputs joint-frequency and -probability distributions
|
| ||||||||||||||||||||||||||||||||||
To understand these tables, I will examine the simplest case first -- the true positive responses. These are cases in which a participant agrees with the question and is from the United States. A true negative occurs when a participant disagrees with the question and is from Canada.
From the point of view of classifying participants, a good question is one that can be used to accurately discriminate among the different types of participants. In other words, one with most observations falling into the TP and TN cells. The TP and TN counts are based on participant records coded as [1, 1] or [0, 0] respectively.
Two types of misclassifications are possible. A false negative occurs when a participant disagrees with the question and is a US resident. A false positive occurs when a participant agrees with the question but is a Canadian resident. Tables with high FN and FP counts indicate that the question fails to correctly classify participants in many instances. The FN and FP counts are based on participant records coded as [0, 1] or [1, 0] respectively.
Looking at this table you can see that the question appears to be differentiating among Canadian and US respondents, but that it has produced one false negative response already. It is too early to know how good a test question this will be, but current indications are that it is in agreement with past results.
The ClassifierDiagnostics class is also designed to work with higher level summary data. The following code shows how you can directly load information taken from a crosstabs table similar to the confusion matrices generated by the showCrossTabs() method.
The data that is loaded comes from a study called "Graded exercise stress tests in angiographically documented coronary heart disease." Patients in the study were given the equivalent of a test question and a classification question. The test question was whether the patient had a positive or negative electrocardiograph (ECG) test. The classification question was whether the patient had a positive or negative coronary disease test (defined as coronary artery stenosis of at least 70 percent).
Listing 2. Taking data directly from a crosstabs table
<?php
/**
* load_joint_frequency.php
*
* Test loadJointFrequency() method by supplying it with data
* in the form of a joint frequency matrix. All other statistics
* can be computed from such data.
*
* Data used in this example is actual data taken from a study
* of ECG and chest pain. For more details:
*
* @see http://hippocrates.ouhsc.edu/cdmtutor/2x2/2x2tut2.html
*/
require_once "../ClassifierDiagnostics.php";
$joint_freq[1][1] = 137; // True Positives
$joint_freq[0][0] = 112; // True Negatives
$joint_freq[1][0] = 11; // False Positives
$joint_freq[0][1] = 90; // False Negatives
$classifier = new ClassifierDiagnostics();
$classifier->loadJointFrequency($joint_freq);
$classifier->setRowName("ECG");
$classifier->setRowTrue("Positive");
$classifier->setRowFalse("Negative");
$classifier->setColumnName("Coronary Disease");
$classifier->setColumnTrue("Yes");
$classifier->setColumnFalse("No");
$classifier->showCrossTabs();
$classifier->showStats();
?>
|
The output of the showCrossTabs() method is:
Table 2. Output from showCrossTabs() method
|
| ||||||||||||||||||||||||||||||||||
What you have not yet seen are the host of additional statistics that are output when you call the showStats() method:
Table 3. Additional statistics from the showStats() method call
| Description | Statistic |
|---|---|
| Test Sensitivity (TP) | 0.60 |
| False Alarm Rate (FP) | 0.09 |
| Miss Rate (FN) | 0.40 |
| Test Specificity (TN) | 0.91 |
| Base Rate | 0.65 |
| P(+Test) | 0.42 |
| P(-Test) | 0.58 |
| P(+Class | +Test) | 0.93 |
| P(-Class | +Test) | 0.07 |
| P(+Class | -Test) | 0.45 |
| P(-Class | -Test) | 0.55 |
| Likelihood Ratio(+Test) | 6.75 |
| Likelihood Ratio(-Test) | 0.44 |
| Accuracy | 0.71 |
| Gain | 1.43 |
You can verify that these statistics are correct by comparing them to the output of Dr. Hamm's Clinical Decision Making Calculator, upon which I based the output of this class. You can also learn more about how these terms are computed and what they mean by studying this site and other similar sites dedicated to clinical decision making.
The following table provides the formulas used to calculate each showStats() statistic:
Table 4. Formulas used by
showStats() method| Description | Formula |
|---|---|
| Sensitivity (TP) | TP / (TP+FN) |
| False Alarms (FP) | FP / (FP+TN) |
| Misses (FN) | FN / (TP+FN) |
| Specificity (TN) | TN / (FP+TN) |
| Base Rate | (TP+FN) / (TP+FP+FN+TN) |
| P(+Test) | (TP+FP) / (TP+FP+FN+TN) |
| P(-Test) | (FN+TN) / (TP+FP+FN+TN) |
| P(+Class | +Test) | TP / (TP+FP) |
| P(-Class | +Test) | FP / (TP+FP) |
| P(-Class | -Test) | TN / (FN+TN) |
| P(+Class | -Test) | FN / (FN+TN) |
| Likelihood Ratio(+Test) | (TP / (TP+FN)) / (FP / FP+TN)) |
| Likelihood Ratio(-Test) | (FN /(TP+FN)) / (TN / FP+TN)) |
| Accuracy | (TP+TN) / (TP+FP+FN+TN) |
| Gain | (TP / (TP+FP)) / ((TP+FN) / (TP+FP+FN+TN)) |
The sensitivity (TP) and specificity (TN) metrics are two of the most important to look at. In this ECG study, the sensitivity of the ECG test was 0.60. One might conclude that the ECG test is not a particularly good test to use because in 40 out of 100 cases, the ECG does not pick up the fact that a person has heart disease. The specificity score is 0.91 which means that the ECG test is relatively good when used to classify someone as not having heart disease.
Around the middle of Table 4 you will notice that the following conditional probabilities are being computed (by dividing a joint frequency by a marginal frequency):
- P(+Class | +Test) = TP / (TP+FP) = 137 / (137 + 11) = 0.93
- P(-Class | +Test) = FP / (TP+FP) = 11 / (137 + 11) = 0.07
- P(-Class | -Test) = TN / (FN+TN) = 112 / (90 + 112) = 0.45
- P(+Class | -Test) = FN / (FN+TN) = 90 / (90 + 112) = 0.55
Interestingly, even though the sensitivity of the test is not high(like 0.60), the value of P(+Class | +Test) is quite high (0.93). While both metrics have TP in the numerator, the sensitivity metric divides by a column marginal (sum of the two frequencies in the TP column) while the P(+Class | +Test) metric divides by a row marginal (sum of the two frequencies in the TP row).
So, while the ECG test may not be good at picking up all cases of heart disease, when it does read positive in 93 out of 100 cases, the person does in fact have heart disease.
The likelihood ratio for a positive test is the ratio of true positives (TP) to false positives (FP). You can use it to compare the likelihood of a positive ECG test result given that a patient has heart disease (TP) or does not (FP). The obtained value of 6.75 means that a positive ECG test is 6.75 times more likely to occur when a patient has heart disease than when they do not. The likelihood ratio for a negative test is the ratio of false negatives (FN) to true negatives (TN). You can use it to compare the likelihood of a negative ECG test result given that you have heart disease (FN) versus that you do not (TN). The obtained value of 0.44 means that a negative ECG test result is more likely to occur when a patient does not have heart disease.
The last statistic I will discuss is the gain metric. While this may appear to be a complicated formula, it is really just the conditional probability of the disease given a positive test result P (+Class | +Test) divided by the base rate of the disease P(+Class). It tells how much better one is at predicting the disease by using the diagnostic test relative to using the unconditional probability P(+Class) to guide predictions. A gain of 1.43 means that the diagnostic test improves the likelihood of correctly diagnosing heart disease by 143 percent relative to not using it.
Now that you have learned about metrics that you can use to assess the accuracy of a classifier, you can build one. I'll discuss and implement a Naive Bayes classifier. The classification task you will focus on is using data from a multivariate binary classification survey to classify participants into one of two categories.
I previously discussed a survey design that involved asking participants to complete a pre-survey before taking the main survey. The pre-survey asked participants whether they are residents of the United States or Canada. After they answered this question, the survey then asked them a binary response question (such as, whether they agreed or disagreed with a statement). A multivariate binary classification survey differs from a simple binary classification survey in that you can ask an unlimited number of binary-response questions in the main survey portion of the design. The hypothetical survey data involves seven participants, three test questions (denoted q1, q2, q3), and one classification question (denoted c1).
Table 5. Database table used to store results of multivariate binary classification survey
| id | q1 | q2 | q3 | c1 |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 1 | 1 |
| 3 | 0 | 0 | 1 | 0 |
| 4 | 0 | 0 | 1 | 0 |
| 5 | 0 | 1 | 1 | 1 |
| 6 | 1 | 1 | 1 | 1 |
| 7 | 0 | 0 | 1 | 0 |
Naive Bayes classifier: the theory
The survey data contains information about how test question responses co-vary with classification question responses. Building a classifier involves devising an algorithm that uses sample data to learn about the relationships between object attributes and classifications so you can apply the classifier to new data that does not have an associated classification.
Imagine that a participant responds to the test questions in the US-Canada values survey as follows: q1=0, q2=1, q3=1. Formally speaking, what you want to know is given this measurement vector whether it is more probable that the respondent is an American (coded as 1) or Canadian (coded as 0):
P(c1 = 0 | q1=0 & q2=1 & q3=1) = ?
P(c1 = 1 | q1=0 & q2=1 & q3=1) = ?
One might immediately note that you are being asked to compute a conditional probability and that Bayes theorem might be relevant. In the Bayes formula below you use ci to denote the classification possibilities and qj to denote each question response:
P(ci | qj) = [ P( qj | ci) * P( ci) ] / P(qj)
To determine which classification is most probable, you do not need to compute the denominator term P(qj) since it is constant for all classifications:
P(ci | qj) ~ P( qj | ci) * P( ci)
Computing the prior probability term P( ci) is easy. You simply compute the proportion of 0s and 1s in the classification column:
P(c=0) = 3/7
P(c=1) = 4/7
Computing the likelihood term P( qj | ci) is more challenging. The Naive Bayes classifier is called naive because it makes the simplifying assumption that each attribute is conditionally independent. (I won't explain this concept now, but will note that it allows you to compute a likelihood for each question and to multiply these likelihood terms together to get the overall likelihood of the responses given a particular classification.)
P( qj | c1=0 ) = P(q1=0 | c1=0) * P(q2=1 | c1=0) * P(q3=1 | c1=0)
P( qj | c1=1 ) = P(q1=0 | c1=1) * P(q2=1 | c1=1) * P(q3=1 | c1=1)
You can understand how the likelihood terms are computed by understanding how one of the right-hand terms is computed.
P(q1=0 | c1=0) = count(q1=0 AND c1=0) / count(c1=0) = 3 / 3 = 1.
To compute P(q1=0 | c1=0), just use the frequency-based enumeration formula for computing a conditional probability P(a | b) = ab / b. In other words, you are dividing a joint frequency by a marginal frequency.
The final result you are looking for is obtained when you select the maximum posterior probability computed by these formulas:
P( qj | c1=0 ) = P( qj | c1=0 ) * P(c1=0) = ?
P( qj | c1=1 ) = P( qj | c1=0 ) * P(c1=1) = ?
Naive Bayes classifier: The code
Now that you know the theory behind Naive Bayes classifiers, look at how it all works from a code perspective. Below is some code that invokes the classifier and classifies participants based upon their responses to the main survey questions:
Listing 3. Invoking NaiveBayes classifier on survey data
<?php
require_once "../NaiveBayes.php";
$bayes = new NaiveBayes;
$bayes->setTable("Survey");
$bayes->setAttributes(array("q1","q2","q3"));
$bayes->setClass("c1");
$bayes->setClassValues(array("0","1"));
$bayes->learn();
$bayes->classify(array("0","1","1"));
?>
|
In Listing 3, the Naive Bayes classifier is designed to work from data contained in a database table:
- The third statement (
setTable()method) sets the database table to use. - The fourth statement (
setAttributes()method) tells the classifier which database columns contain the attribute values. - The fifth statement (
setClass()method) tells the classifier which column contains the classification values. - The sixth statement (
setClassValues()method) specifies the classification options. - The seventh statement (
learn()method) is where things get interesting; this is the step where you instruct the classifier to start learning.
As you can see, learning consists of computing the joint-frequency distribution and prior distribution. Note also that the key to simplifying this computation is to represent the joint-frequency distribution as a three-dimensional array that is updated as you examine each data point in the survey table.
Listing 4. Source of the learn() method
<?php
/**
* Learn the prior probability of each class and the
* joint frequency of each class and attribute.
*/
function learn() {
include "connect.php";
foreach($this->attributes as $attribute)
$field_list .= "$attribute,";
$field_list = substr($field_list, 0, -1);
foreach($this->class_values as $class) {
$sql = " SELECT $field_list FROM $this->table WHERE $this->class='$class' ";
$result = mysql_query($sql);
$this->priors[$class] = mysql_num_rows($result);
while ($row = mysql_fetch_assoc($result)) {
foreach($this->attributes as $attribute) {
$attribute_value = $row[$attribute];
$this->joint_frequency[$class][$attribute][$attribute_value]++;
}
}
}
}
?>
|
The eighth statement in Listing 3, $bayes->classify(array("0","1","1")), is where you classify a new instance by feeding in the attribute values to use (q1, q2, q3). The classify method looks like this:
Listing 5. Source of the classify() method
<?php
/**
* Given a set of attribute values, this routine will
* predict the class they most likley belong to.
*/
function classify($attribute_values) {
$this->attribute_values = $attribute_values;
$this->n = array_sum($this->priors);
$this->max = 0;
foreach($this->class_values as $class) {
$counter = 0;
$this->likelihoods[$class] = 1;
foreach($this->attributes as $attribute) {
$attribute_value = $attribute_values[$counter];
$joint_freq = $this->joint_frequency[$class][$attribute][$attribute_value];
$likelihood = $joint_freq / $this->priors[$class];
if ($joint_freq > 0) {
$this->likelihoods[$class] = $this->likelihoods[$class] * $likelihood;
}
$counter++;
}
$prior = $this->priors[$class] / $this->n;
$this->posterior[$class] = $this->likelihoods[$class] * $prior;
if ($this->posterior[$class] > $this->max) {
$this->predict = $class;
$this->max = $this->posterior[$class];
}
}
}
?>
|
Here you are simply implementing the formulas discussed in the theory section. The only remaining question is what the results are.
Naive Bayes classifier: The output
To examine the results of running the Naive Bayes classifier, you append the following output generation code:
Listing 6. PHP source for outputting classification formula and NaiveBayes object
<?php $bayes->toHTML(); echo "<pre>"; print_r($bayes); echo "</pre>"; ?> |
The line $bayes->toHTML() produces the following output:
P(c1 = 0 | q1=0 & q2=1 & q3=1) = 0.42857142857143 *
P(c1 = 1 | q1=0 & q2=1 & q3=1) = 0.28571428571429
The asterisk denotes the classification that has the highest posterior probability and could be used to automatically classify the participant as a Canadian resident.
After you call the toHTML() method, apply the print_r method to the $bayes object to produce a human readable depiction of the Naive Bayes object. The print_r($bayes) output summarizes the theory and details of Naive Bayes classification:
Listing 7. Human readable depiction of NaiveBayes object
naivebayes Object
(
[table] => Survey
[attributes] => Array
(
[0] => q1
[1] => q2
[2] => q3
)
[attribute_values] => Array
(
[0] => 0
[1] => 1
[2] => 1
)
[class] => c1
[class_values] => Array
(
[0] => 0
[1] => 1
)
[joint_frequency] => Array
(
[0] => Array
(
[q1] => Array
(
[0] => 3
)
[q2] => Array
(
[0] => 3
)
[q3] => Array
(
[1] => 3
)
)
[1] => Array
(
[q1] => Array
(
[1] => 2
[0] => 2
)
[q2] => Array
(
[1] => 4
)
[q3] => Array
(
[1] => 4
)
)
)
[priors] => Array
(
[0] => 3
[1] => 4
)
[likelihoods] => Array
(
[0] => 1
[1] => 0.5
)
[posterior] => Array
(
[0] => 0.42857142857143
[1] => 0.28571428571429
)
[n] => 7
[predict] => 0
[max] => 0.42857142857143
)
|
In the first part of this article, I talked about metrics to measure classifier accuracy. Now that you have built a Naive Bayes classifier, you could potentially use these metrics to assess whether the classifier is doing a good job.
In practice, the way this would typically be done is to have a large enough sample that you could split it into a training sample and a validation sample. The idea is to apply the learn() method to your training sample and to sequentially apply the classify() method to each participant response from the validation sample (the values of q1, q2, and q3).
For each participant in the validation sample, you then determine whether the classifier accurately classified them or not. You might display the resulting tallies in a crosstabs table and load this summary data directly into the ClassifierDiagnostics class for a more in-depth analysis of how your classifier is performing. (In general, the performance of a classifier is poorer when run on the validation sample than on the training sample because the classifier overfits the data it is being trained on. A validation sample is required to study how well the classifier generalizes to new data sets.)
The Naive Bayes classifier assumes that the responses to each test question are conditionally independent of each other. Express this idea as follows:
P(q1 | q2) = P(q1)
Do participant responses to one question affect the probability that they will respond in the same or opposite way to another question? If several questions measure the same factor, then the answer is likely yes. One interesting aspect of Naive Bayes classification is that it still works quite well even when the conditional independence assumption is invalid.
In general, as you pass from simple classification tasks to multivariate classification tasks, the most difficult and complex issue you have to deal with is whether your object attributes are independent or not. If they are not, then you need to examine the nature of that dependence. If, for example, a high linear correlation occurs between the variables, then the additional feature may not be providing much new information to your classifier and you should consider eliminating it. At the end of the day, you want a classifier that is both accurate and simple.
In this article, I demonstrated that conditional probability concepts and Bayes methods play an important role in assessing classifier accuracy and in building classifiers themselves.
I examined various diagnostics statistics that you can compute from confusion matrix data in order to assess classifier accuracy. You learned that classifier diagnostics that are commonly calculated in medical diagnostic testing can also be applied to binary classification surveys. You applied the ClassifierDiagnostics class to a simple binary classification survey; however, it is easy to envision it being effectively applied pairwise to each survey question (q1, q2 ... qN) and classification question (c1) in a multivariate binary classification survey.
You constructed a Naive Bayes classifier and learned how to use it to classify participants taking a multivariate binary classification survey. I discussed the fact that the Naive Bayes classification formula makes a strong independence assumption between classification attributes, but that even when this assumption is violated, a Naive Bayes classifier often works quite well. Locally optimizing the pairwise relationship between attributes and classifications is a fast and powerful technique that may not produce optimal classification performance, but often produces good results and performance on large data sets. Please note that the Naive Bayes classifier can be applied to categorical data in general, not just binary categorical data.
I hope you've started to appreciate the many inference problems to which Bayes methods can be applied, as well as conditional probability concepts and code that you can use to build your own intelligent Web applications.
| Name | Size | Download method |
|---|---|---|
| wa-bayes3.tar.gz | 6KB | HTTP |
Information about download methods
- Learn the basic theory and algorithm for Naive Bayes classifiers from
M. Kantardzic
Data Mining: Concepts, Models, Methods, and Algorithms, IEEE Press
& John Wiley, November 2002.
- Download the source code used in this article. Updates to article code will be made available at PHPMath.com.
- Try these two statistic calculators:
- Epidemiology: Introduction to CDM Calculators -- the basis for this article's statistics reported by the simple binary classifier
- EpiMax Table Calculator -- modeled on the previous calculator, it reports more stats and can be used online
- Explore the OpenEpi Project for a collection of open source software for epidemiologic statistics in JavaScript and HTML.
- Explore probability concepts in the Virtual Laboratories in Probability and Statistics center.
- Visit Radford Neal's site for research on Bayesian neural networks and essays on the philosophy of Bayesian inference.
- Identify data characteristics for which naive Bayes performs well in "An analysis of data characteristics that affect naive Bayes performance" (PDF file, 349 KB) by Rish, Hellerstein, and Thathachar (IBM Research, 2001).
- Learn a method to improve the probability estimates made by Naive Bayes and to avoid the effects of poor class conditional probabilities in "A Decomposition Of Classes Via Clustering To Explain And Improve Naive Bayes" (PDF file, 160 KB) by Vilalta and Rish (IBM Research, 2003).
- Read earlier articles in the author's series on Bayesian inference:
- "Implement Bayesian inference using PHP, Part 1" implements the underlying conditional probability calculations using PHP (developerWorks, March 2004)
- "Implement Bayesian inference using PHP, Part 2" solves parameter estimation problems (developerWorks, April 2004).
- In "PHP Naive Bayesian Filter," apply The Naive Bayes classifier to text classification problems using PHP.
- Craft Web data-gathering applications using probability models in "Apply probability models to Web data using PHP" (developerWorks, October 2003).
- Learn to construct a user-modeling platform with PHP in "Web site user modeling with PHP" (developerWorks, December 2003).
- In Classification and Regression Trees (Wadsworth ; 1984), by Breiman, Friedman, Olshen, and Stone, explore the details of both the practical and theoretical sides of classification developed in the authors' study of tree methods.
- Join a a practical discussion on how to interpret and use information reported by the
ClassifierDiagnosticsclass in Handbook of Parametric and NonParametric Statistical Procedures (CRC Press; 2003), by David Sheskin. - Check out Artificial Intelligence: A Modern Approach (Prentice Hall; 2003), by Russell and Norvig, which inspired the author's implementation of a Naive Bayes classifier.
- Dig into Fire and Ice: The United States, Canada and the Myth of Converging Values (Penguin Books, Canada; 2003), by Michael Adams, the source of the survey comparing US and Canadian value orientations.
- Explore the Bayesian approach to statistics at a level suitable for final year undergraduate and masters students in Bayesian Methods (Cambridge University Press; 1999), by Leonard and Hsu.
- Read Statistics: Probability, Inference, and Decision, 2nd ed. (International Thomson Publishing; 1975), by Winkler and Hayes, a source that the author relied on for this article.
- Browse for books on these and other technical topics.
- Visit developerWorks Web Architecture zone for a range of articles on the topic of Web architecture and usability.
Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current interests include statistical computing, data mining, content management, and e-learning. You can reach Paul at paul@datavore.com




