The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

All information submitted is secure.

# Implement Bayesian inference using PHP: Part 3

Solving classification problems

Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current interests include statistical computing, data mining, content management, and e-learning. You can reach Paul at paul@datavore.com

Summary:  In this third article on Bayesian inference, Paul Meagher examines how to use PHP and Bayes methods to solve classification problems in medical diagnostic testing and Web survey analysis. Learn how Bayesian and conditional probability concepts are applicable to both building classifier systems and analyzing the accuracy of their output.

Date:  11 May 2004
Level:  Intermediate PDF:  A4 and Letter (343 KB | 19 pages)Get Adobe® Reader®

Activity:  10062 views

In the previous article, I examined how Bayes methods can be used to solve parameter estimation problems. You learned about maximum likelihood estimators, binomial random variables, Bernoulli processes, the beta distribution, and conjugate priors.

In this article, you will learn how to use Bayesian inference to solve classification problems in medical diagnostic testing and the analysis of binary classification surveys. You will begin by examining how conditional probability concepts can be used to assess classification performance in the context of binary classification surveys and medical diagnostic testing. You will then build and apply a Naive Bayes classifier to the results of binary classification surveys.

Bayesian and conditional probability concepts are applicable to both building classifier systems and analyzing the accuracy of their output.

Medical and survey classification

A seminal work in the area of classification theory is Classification and Regression Trees by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone (Kluwer Academic Publishers, 1984). In the opening paragraphs, the authors discuss an interesting medical situation in which large amounts of data are collected on patients and used for classification purposes:

At the University of California, San Diego Medical Center, when a heart attack patient is admitted, 19 variables are measured during the first 24 hours. These include blood pressure, age, and 17 other ordered and binary variables summarizing the medical symptoms considered as important indicators of the patient's condition.
The goal of a recent medical study was the development of a method to identify high risk patients (those who will not survive at least 30 days) on the basis of the initial 24-hour data.
.... This [tree-structured classification] rule classifies incoming patients as F [not high risk] or G [high risk] depending on the yes-no answers to, at most, three questions. Its simplicity raises the suspicion that standard statistical classification methods may give classification rules that are more accurate. When these are tried, the rules produced were considerably more intricate, but less accurate. (p. 1)

The final paragraph of this quote is interesting because it suggests that the goal of data analysis in classifier-oriented research should be to construct a classifier that is, first of all, accurate and, after that, one that is the simplest in terms of the number of variables that are used by the classifier.

I will now examine in more detail the various conditional probability metrics you can use to measure classification accuracy. The classification data that you will examine results from a simple binary classification survey; a survey with one binary test question, q1, and one binary classification question, c1. Later, when you build your Naive Bayes classifier, you will apply it to multivariate binary classification surveys (where the number of test questions qi is unlimited).

Design of a classification survey

I am reading a book, Fire and Ice: The United States, Canada, and the Myth of Converging Values, that compares Canadian and United States residents in terms of their common and diverging values systems. Based on the data reported in the book, I have reason to believe that the following question has some power to differentiate between Americans and Canadians in terms of their value orientations:

Do you agree with the following statement? "The father of the family must be master in his own home." (Y / N)

In 2000, when this question was last asked to a representative sample of Canadian and US residents, the results showed that 18 percent of Canadians agreed with this statement, whereas 49 percent of US residents agreed with it.

To determine how good this question is at differentiating between Canadian and US participants, you can conduct a simple binary classification survey with a two-stage design. In the first stage, participants answer a pre-survey question that asks them what country they belong to. In the second stage, the participants answer the main survey question as stated previously.

This type of survey design allows you to ascertain how effectively this question differentiates between Canadian and US respondents. Specifically, this design allows you to compute true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates, which are basic quantities useful in assessing the accuracy of your classifier and the diagnostic quality of your questions.

These quantities can also be readily used in various conditional probability formulas to compute other accuracy measures. The range of accuracy metrics we will be discussing are frequently used to assess the appropriateness and usefulness of medical tests and might also be expected to be useful in assessing the appropriateness and usefulness of questions in simple binary classification surveys.

Classifier diagnostics

The following code analyzes data obtained from four hypothetical survey participants. The first column of the `\$data` matrix represents the response to the main survey question while the second column represents the response to the pre-survey question. The raw data from this survey is fed into the `ClassifierDiagnostics` class using this two-dimensional array format. You can imagine querying a survey table for such data, storing the result set in this array format, then feeding the array into this class for analysis.

Listing 1. load_raw_data.php script for array data analysis
 ```setRowName("Patriarchy"); \$classifier->setRowTrue("Agree"); \$classifier->setRowFalse("Disagree"); \$classifier->setColumnName("Country"); \$classifier->setColumnTrue("US"); \$classifier->setColumnFalse("CAN"); \$classifier->showCrossTabs(); ?> ```

The `showCrossTabs()` method outputs the joint-frequency distribution and joint-probability distribution of responses to the patriarchy question and county-classification question. These joint-frequency tables are also commonly referred to as confusion matrices.

Table 1. showCrossTabs outputs joint-frequency and -probability distributions
 Joint-frequency Country US CAN Patriarchy agree 2(TP) 0(FP) disagree 1(FN) 1(TN)
 Joint-probability Country US CAN Patriarchy agree 0.67(TP) 0.00(FP) disagree 0.33(FN) 1.00(TN)

To understand these tables, I will examine the simplest case first -- the true positive responses. These are cases in which a participant agrees with the question and is from the United States. A true negative occurs when a participant disagrees with the question and is from Canada.

From the point of view of classifying participants, a good question is one that can be used to accurately discriminate among the different types of participants. In other words, one with most observations falling into the TP and TN cells. The TP and TN counts are based on participant records coded as [1, 1] or [0, 0] respectively.

Two types of misclassifications are possible. A false negative occurs when a participant disagrees with the question and is a US resident. A false positive occurs when a participant agrees with the question but is a Canadian resident. Tables with high FN and FP counts indicate that the question fails to correctly classify participants in many instances. The FN and FP counts are based on participant records coded as [0, 1] or [1, 0] respectively.

Looking at this table you can see that the question appears to be differentiating among Canadian and US respondents, but that it has produced one false negative response already. It is too early to know how good a test question this will be, but current indications are that it is in agreement with past results.

Working from summary data

The `ClassifierDiagnostics` class is also designed to work with higher level summary data. The following code shows how you can directly load information taken from a crosstabs table similar to the confusion matrices generated by the `showCrossTabs()` method.

The data that is loaded comes from a study called "Graded exercise stress tests in angiographically documented coronary heart disease." Patients in the study were given the equivalent of a test question and a classification question. The test question was whether the patient had a positive or negative electrocardiograph (ECG) test. The classification question was whether the patient had a positive or negative coronary disease test (defined as coronary artery stenosis of at least 70 percent).

Listing 2. Taking data directly from a crosstabs table
 ```loadJointFrequency(\$joint_freq); \$classifier->setRowName("ECG"); \$classifier->setRowTrue("Positive"); \$classifier->setRowFalse("Negative"); \$classifier->setColumnName("Coronary Disease"); \$classifier->setColumnTrue("Yes"); \$classifier->setColumnFalse("No"); \$classifier->showCrossTabs(); \$classifier->showStats(); ?> ```

The output of the `showCrossTabs()` method is:

Table 2. Output from showCrossTabs() method
 Joint-frequency Coronary Disease Yes No ECG Positive 137(TP) 11(FP) Negative 90(FN) 112(TN)
 Joint-probability Coronary Disease Yes No ECG Positive 0.60(TP) 0.09(FP) Negative 0.40(FN) 0.91(TN)

What you have not yet seen are the host of additional statistics that are output when you call the `showStats()` method:

Table 3. Additional statistics from the showStats() method call
DescriptionStatistic
Test Sensitivity (TP)0.60
False Alarm Rate (FP)0.09
Miss Rate (FN)0.40
Test Specificity (TN)0.91
Base Rate0.65
P(+Test)0.42
P(-Test)0.58
P(+Class | +Test)0.93
P(-Class | +Test)0.07
P(+Class | -Test)0.45
P(-Class | -Test)0.55
Likelihood Ratio(+Test)6.75
Likelihood Ratio(-Test)0.44
Accuracy0.71
Gain1.43

You can verify that these statistics are correct by comparing them to the output of Dr. Hamm's Clinical Decision Making Calculator, upon which I based the output of this class. You can also learn more about how these terms are computed and what they mean by studying this site and other similar sites dedicated to clinical decision making.

Classifier metrics explained

The following table provides the formulas used to calculate each `showStats()` statistic:

Table 4. Formulas used by `showStats()` method
DescriptionFormula
Sensitivity (TP)TP / (TP+FN)
False Alarms (FP)FP / (FP+TN)
Misses (FN)FN / (TP+FN)
Specificity (TN)TN / (FP+TN)
Base Rate(TP+FN) / (TP+FP+FN+TN)
P(+Test)(TP+FP) / (TP+FP+FN+TN)
P(-Test)(FN+TN) / (TP+FP+FN+TN)
P(+Class | +Test)TP / (TP+FP)
P(-Class | +Test)FP / (TP+FP)
P(-Class | -Test)TN / (FN+TN)
P(+Class | -Test)FN / (FN+TN)
Likelihood Ratio(+Test)(TP / (TP+FN)) / (FP / FP+TN))
Likelihood Ratio(-Test)(FN /(TP+FN)) / (TN / FP+TN))
Accuracy(TP+TN) / (TP+FP+FN+TN)
Gain(TP / (TP+FP)) / ((TP+FN) / (TP+FP+FN+TN))

The sensitivity (TP) and specificity (TN) metrics are two of the most important to look at. In this ECG study, the sensitivity of the ECG test was 0.60. One might conclude that the ECG test is not a particularly good test to use because in 40 out of 100 cases, the ECG does not pick up the fact that a person has heart disease. The specificity score is 0.91 which means that the ECG test is relatively good when used to classify someone as not having heart disease.

Around the middle of Table 4 you will notice that the following conditional probabilities are being computed (by dividing a joint frequency by a marginal frequency):

• P(+Class | +Test) = TP / (TP+FP) = 137 / (137 + 11) = 0.93
• P(-Class | +Test) = FP / (TP+FP) = 11 / (137 + 11) = 0.07
• P(-Class | -Test) = TN / (FN+TN) = 112 / (90 + 112) = 0.45
• P(+Class | -Test) = FN / (FN+TN) = 90 / (90 + 112) = 0.55

Interestingly, even though the sensitivity of the test is not high(like 0.60), the value of P(+Class | +Test) is quite high (0.93). While both metrics have TP in the numerator, the sensitivity metric divides by a column marginal (sum of the two frequencies in the TP column) while the P(+Class | +Test) metric divides by a row marginal (sum of the two frequencies in the TP row).

So, while the ECG test may not be good at picking up all cases of heart disease, when it does read positive in 93 out of 100 cases, the person does in fact have heart disease.

The likelihood ratio for a positive test is the ratio of true positives (TP) to false positives (FP). You can use it to compare the likelihood of a positive ECG test result given that a patient has heart disease (TP) or does not (FP). The obtained value of 6.75 means that a positive ECG test is 6.75 times more likely to occur when a patient has heart disease than when they do not. The likelihood ratio for a negative test is the ratio of false negatives (FN) to true negatives (TN). You can use it to compare the likelihood of a negative ECG test result given that you have heart disease (FN) versus that you do not (TN). The obtained value of 0.44 means that a negative ECG test result is more likely to occur when a patient does not have heart disease.

The last statistic I will discuss is the gain metric. While this may appear to be a complicated formula, it is really just the conditional probability of the disease given a positive test result P (+Class | +Test) divided by the base rate of the disease P(+Class). It tells how much better one is at predicting the disease by using the diagnostic test relative to using the unconditional probability P(+Class) to guide predictions. A gain of 1.43 means that the diagnostic test improves the likelihood of correctly diagnosing heart disease by 143 percent relative to not using it.

Survey data format

Now that you have learned about metrics that you can use to assess the accuracy of a classifier, you can build one. I'll discuss and implement a Naive Bayes classifier. The classification task you will focus on is using data from a multivariate binary classification survey to classify participants into one of two categories.

I previously discussed a survey design that involved asking participants to complete a pre-survey before taking the main survey. The pre-survey asked participants whether they are residents of the United States or Canada. After they answered this question, the survey then asked them a binary response question (such as, whether they agreed or disagreed with a statement). A multivariate binary classification survey differs from a simple binary classification survey in that you can ask an unlimited number of binary-response questions in the main survey portion of the design. The hypothetical survey data involves seven participants, three test questions (denoted q1, q2, q3), and one classification question (denoted c1).

Table 5. Database table used to store results of multivariate binary classification survey
idq1q2q3c1
11111
20111
30010
40010
50111
61111
70010

Naive Bayes classifier: the theory

The survey data contains information about how test question responses co-vary with classification question responses. Building a classifier involves devising an algorithm that uses sample data to learn about the relationships between object attributes and classifications so you can apply the classifier to new data that does not have an associated classification.

Imagine that a participant responds to the test questions in the US-Canada values survey as follows: q1=0, q2=1, q3=1. Formally speaking, what you want to know is given this measurement vector whether it is more probable that the respondent is an American (coded as 1) or Canadian (coded as 0):

P(c1 = 0 | q1=0 & q2=1 & q3=1) = ?
P(c1 = 1 | q1=0 & q2=1 & q3=1) = ?

One might immediately note that you are being asked to compute a conditional probability and that Bayes theorem might be relevant. In the Bayes formula below you use ci to denote the classification possibilities and qj to denote each question response:

P(ci | qj) = [ P( qj | ci) * P( ci) ] / P(qj)

To determine which classification is most probable, you do not need to compute the denominator term P(qj) since it is constant for all classifications:

P(ci | qj) ~ P( qj | ci) * P( ci)

Computing the prior probability term P( ci) is easy. You simply compute the proportion of 0s and 1s in the classification column:

P(c=0) = 3/7
P(c=1) = 4/7

Computing the likelihood term P( qj | ci) is more challenging. The Naive Bayes classifier is called naive because it makes the simplifying assumption that each attribute is conditionally independent. (I won't explain this concept now, but will note that it allows you to compute a likelihood for each question and to multiply these likelihood terms together to get the overall likelihood of the responses given a particular classification.)

P( qj | c1=0 ) = P(q1=0 | c1=0) * P(q2=1 | c1=0) * P(q3=1 | c1=0)
P( qj | c1=1 ) = P(q1=0 | c1=1) * P(q2=1 | c1=1) * P(q3=1 | c1=1)

You can understand how the likelihood terms are computed by understanding how one of the right-hand terms is computed.

P(q1=0 | c1=0) = count(q1=0 AND c1=0) / count(c1=0) = 3 / 3 = 1.

To compute P(q1=0 | c1=0), just use the frequency-based enumeration formula for computing a conditional probability P(a | b) = ab / b. In other words, you are dividing a joint frequency by a marginal frequency.

The final result you are looking for is obtained when you select the maximum posterior probability computed by these formulas:

P( qj | c1=0 ) = P( qj | c1=0 ) * P(c1=0) = ?
P( qj | c1=1 ) = P( qj | c1=0 ) * P(c1=1) = ?

Naive Bayes classifier: The code

Now that you know the theory behind Naive Bayes classifiers, look at how it all works from a code perspective. Below is some code that invokes the classifier and classifies participants based upon their responses to the main survey questions:

Listing 3. Invoking NaiveBayes classifier on survey data
 ```setTable("Survey"); \$bayes->setAttributes(array("q1","q2","q3")); \$bayes->setClass("c1"); \$bayes->setClassValues(array("0","1")); \$bayes->learn(); \$bayes->classify(array("0","1","1")); ?> ```

In Listing 3, the Naive Bayes classifier is designed to work from data contained in a database table:

• The third statement (`setTable()` method) sets the database table to use.
• The fourth statement (`setAttributes()` method) tells the classifier which database columns contain the attribute values.
• The fifth statement (`setClass()` method) tells the classifier which column contains the classification values.
• The sixth statement (`setClassValues()` method) specifies the classification options.
• The seventh statement (`learn()` method) is where things get interesting; this is the step where you instruct the classifier to start learning.

As you can see, learning consists of computing the joint-frequency distribution and prior distribution. Note also that the key to simplifying this computation is to represent the joint-frequency distribution as a three-dimensional array that is updated as you examine each data point in the survey table.

Listing 4. Source of the learn() method
 ```attributes as \$attribute) \$field_list .= "\$attribute,"; \$field_list = substr(\$field_list, 0, -1); foreach(\$this->class_values as \$class) { \$sql = " SELECT \$field_list FROM \$this->table WHERE \$this->class='\$class' "; \$result = mysql_query(\$sql); \$this->priors[\$class] = mysql_num_rows(\$result); while (\$row = mysql_fetch_assoc(\$result)) { foreach(\$this->attributes as \$attribute) { \$attribute_value = \$row[\$attribute]; \$this->joint_frequency[\$class][\$attribute][\$attribute_value]++; } } } } ?> ```

The eighth statement in Listing 3, `\$bayes->classify(array("0","1","1"))`, is where you classify a new instance by feeding in the attribute values to use (q1, q2, q3). The `classify` method looks like this:

Listing 5. Source of the classify() method
 ```attribute_values = \$attribute_values; \$this->n = array_sum(\$this->priors); \$this->max = 0; foreach(\$this->class_values as \$class) { \$counter = 0; \$this->likelihoods[\$class] = 1; foreach(\$this->attributes as \$attribute) { \$attribute_value = \$attribute_values[\$counter]; \$joint_freq = \$this->joint_frequency[\$class][\$attribute][\$attribute_value]; \$likelihood = \$joint_freq / \$this->priors[\$class]; if (\$joint_freq > 0) { \$this->likelihoods[\$class] = \$this->likelihoods[\$class] * \$likelihood; } \$counter++; } \$prior = \$this->priors[\$class] / \$this->n; \$this->posterior[\$class] = \$this->likelihoods[\$class] * \$prior; if (\$this->posterior[\$class] > \$this->max) { \$this->predict = \$class; \$this->max = \$this->posterior[\$class]; } } } ?> ```

Here you are simply implementing the formulas discussed in the theory section. The only remaining question is what the results are.

Naive Bayes classifier: The output

To examine the results of running the Naive Bayes classifier, you append the following output generation code:

Listing 6. PHP source for outputting classification formula and NaiveBayes object
 ```toHTML(); echo "
"; print_r(\$bayes); echo "
"; ?> ```

The line `\$bayes->toHTML()` produces the following output:

P(c1 = 0 | q1=0 & q2=1 & q3=1) = 0.42857142857143 *
P(c1 = 1 | q1=0 & q2=1 & q3=1) = 0.28571428571429

The asterisk denotes the classification that has the highest posterior probability and could be used to automatically classify the participant as a Canadian resident.

After you call the `toHTML()` method, apply the `print_r` method to the `\$bayes` object to produce a human readable depiction of the Naive Bayes object. The `print_r(\$bayes)` output summarizes the theory and details of Naive Bayes classification:

Listing 7. Human readable depiction of NaiveBayes object
 ```naivebayes Object ( [table] => Survey [attributes] => Array ( [0] => q1 [1] => q2 [2] => q3 ) [attribute_values] => Array ( [0] => 0 [1] => 1 [2] => 1 ) [class] => c1 [class_values] => Array ( [0] => 0 [1] => 1 ) [joint_frequency] => Array ( [0] => Array ( [q1] => Array ( [0] => 3 ) [q2] => Array ( [0] => 3 ) [q3] => Array ( [1] => 3 ) ) [1] => Array ( [q1] => Array ( [1] => 2 [0] => 2 ) [q2] => Array ( [1] => 4 ) [q3] => Array ( [1] => 4 ) ) ) [priors] => Array ( [0] => 3 [1] => 4 ) [likelihoods] => Array ( [0] => 1 [1] => 0.5 ) [posterior] => Array ( [0] => 0.42857142857143 [1] => 0.28571428571429 ) [n] => 7 [predict] => 0 [max] => 0.42857142857143 ) ```

Measuring classifier accuracy

In the first part of this article, I talked about metrics to measure classifier accuracy. Now that you have built a Naive Bayes classifier, you could potentially use these metrics to assess whether the classifier is doing a good job.

In practice, the way this would typically be done is to have a large enough sample that you could split it into a training sample and a validation sample. The idea is to apply the `learn()` method to your training sample and to sequentially apply the `classify()` method to each participant response from the validation sample (the values of q1, q2, and q3).

For each participant in the validation sample, you then determine whether the classifier accurately classified them or not. You might display the resulting tallies in a crosstabs table and load this summary data directly into the `ClassifierDiagnostics` class for a more in-depth analysis of how your classifier is performing. (In general, the performance of a classifier is poorer when run on the validation sample than on the training sample because the classifier overfits the data it is being trained on. A validation sample is required to study how well the classifier generalizes to new data sets.)

Independence assumption

The Naive Bayes classifier assumes that the responses to each test question are conditionally independent of each other. Express this idea as follows:

P(q1 | q2) = P(q1)

Do participant responses to one question affect the probability that they will respond in the same or opposite way to another question? If several questions measure the same factor, then the answer is likely yes. One interesting aspect of Naive Bayes classification is that it still works quite well even when the conditional independence assumption is invalid.

In general, as you pass from simple classification tasks to multivariate classification tasks, the most difficult and complex issue you have to deal with is whether your object attributes are independent or not. If they are not, then you need to examine the nature of that dependence. If, for example, a high linear correlation occurs between the variables, then the additional feature may not be providing much new information to your classifier and you should consider eliminating it. At the end of the day, you want a classifier that is both accurate and simple.

Conclusions

In this article, I demonstrated that conditional probability concepts and Bayes methods play an important role in assessing classifier accuracy and in building classifiers themselves.

I examined various diagnostics statistics that you can compute from confusion matrix data in order to assess classifier accuracy. You learned that classifier diagnostics that are commonly calculated in medical diagnostic testing can also be applied to binary classification surveys. You applied the `ClassifierDiagnostics` class to a simple binary classification survey; however, it is easy to envision it being effectively applied pairwise to each survey question (q1, q2 ... qN) and classification question (c1) in a multivariate binary classification survey.

You constructed a Naive Bayes classifier and learned how to use it to classify participants taking a multivariate binary classification survey. I discussed the fact that the Naive Bayes classification formula makes a strong independence assumption between classification attributes, but that even when this assumption is violated, a Naive Bayes classifier often works quite well. Locally optimizing the pairwise relationship between attributes and classifications is a fast and powerful technique that may not produce optimal classification performance, but often produces good results and performance on large data sets. Please note that the Naive Bayes classifier can be applied to categorical data in general, not just binary categorical data.

I hope you've started to appreciate the many inference problems to which Bayes methods can be applied, as well as conditional probability concepts and code that you can use to build your own intelligent Web applications.

wa-bayes3.tar.gz6KBHTTP

Resources

Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current interests include statistical computing, data mining, content management, and e-learning. You can reach Paul at paul@datavore.com

Report abuse help

# Report abuse

Thank you. This entry has been flagged for moderator attention.

Report abuse help

# Report abuse

Report abuse submission failed. Please try again later.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Open source
ArticleID=11908
ArticleTitle=Implement Bayesian inference using PHP: Part 3
publish-date=05112004

## IBM SmartCloud trial. No charge.

Unleash the power of hybrid cloud computing today!