Take Web data analysis to the next level with PHP

Design your data analysis to go beyond simple raw counts

Effective, multi-level analysis of Web data is a critical element for the survival of many Web-oriented businesses, and the design (and determination) of data-analysis tests is often the job of systems administrators and in-house application designers who may not have an understanding of statistics beyond tabulating raw counts. In this article, Paul Meagher delivers the skills and concepts Web developers need to be able to apply inferential statistics to their Web data streams.

Paul Meagher (paul@datavore.com), CEO, Datavore Productions

Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current projects and interests center around e-learning, content management, and math-enabled Web applications. Paul resides in Truro, Nova Scotia and can be reached at paul@datavore.com.



05 August 2003

Dynamic Web sites generate an enormous amount of data -- access logs, poll and survey results, customer profiles and orders, and more -- so increasingly, the job of a Web developer is not just to create the applications that generate this data, but also to develop applications and approaches to make sense of these data steams.

Often, the response of Web developers to the growing data-analytic requirements of managing their sites is inadequate. For the most part, Web developers haven't progressed much beyond reporting various descriptive statistics to characterize the data streams. An array of inferential statistical procedures (methodologies for estimating population parameters based upon sample data) could be fruitfully exploited, but at present are not being applied.

For example, Web-access statistics (as currently compiled) are little more than frequency counts grouped in various ways. The results of polls and surveys are too often expressed in terms of simple raw counts and percentages.

Maybe developers shouldn't be expected to deal with the statistical analysis of data streams except in superficial ways. After all, there are those who devote careers to the more complex data-stream analysis; they're called statisticians and trained analysts. They can be brought in when an organization needs more than just descriptive statistics.

However, an alternative response is to acknowledge that increasing savvy with inferential statistics is becoming part of the job description for Web developers. Dynamic sites are generating more and more data and it is arguably the responsibility of Web developers and system administrators to find ways of turning this data into actionable knowledge.

I advocate the latter response; this article is intended to help Web developers and systems administrators learn (or activate, in the case of inert knowledge) the design and analysis skills necessary to apply inferential statistics to their Web data streams.

Relate Web data to experimental design

The application of inferential statistics to Web data streams involves more than learning the math underlying various statistical tests. Equally important is the ability to relate the data-collection process to critical distinctions in experimental design: What is my measurement scale? How representative is my sample? What's my population? What's the hypothesis I'm testing?

To apply inferential statistics to your Web data streams, you need to first think of your results as if they were generated by an experimental design; then select an analysis procedure appropriate to that experimental design. Even if you consider it a stretch to think of Web polls and access log data as the results of an experiment, it is critical for you to do so. Why?

  1. It will help you select an appropriate statistical test.
  2. It will help you draw the appropriate conclusions from your collected data.

One aspect of experimental design that is critical in determining the appropriate statistical test to use is the choice of measurement scale for data collection.


Examples of measurement scales

A measurement scale simply specifies a procedure for assigning symbols, letters, or numbers to a phenomenon of interest. For example, the kilogram scale allows you to assign a number to an object indicating its weight based upon standardized displacements of measuring instruments.

Four measurement scales are of importance:

  • Ratio - The kilogram scale is an example of a ratio scale -- the symbols that are assigned to object attributes have numerical meaning. You can perform various operations on those symbols (such as computing ratios) that you cannot perform using numerical values obtained using less powerful measurement scales.
  • Interval - An interval scale is a scale of measurement in which the distance between any two adjacent units of measurement (also known as intervals) is the same, but the zero point is arbitrary. Examples of interval scales include the measurement of longitude, the heights of tides, and measurements between the start and end of various years. The values on an interval scale can be added and subtracted, but can not be meaningfully multiplied or divided.
  • Rank - A rank scale applies to a set of data that is ordinal (the values and observations belonging to it can be put in order or have a rating scale attached). A common example includes "like-dislike" polls, in which numerals have been assigned to attributes (1 = Strongly dislike to 5 = Strongly like). Usually, categories for an ordinal set of data have a natural order, but the distinction between adjoining points on the scale is not necessarily always the same. You can count and order, but not measure, ordinal data.
  • Nominal - A nominal scale of measurement is the weakest form of measurement scale and involves assigning items to groups or categories. No quantitative information is conveyed and no ordering of the items is implied by the measurements. The main numerical operation you perform on nominal scale data is counting the frequency of items in each category.

The following table contrasts the features of each scale of measurement:

Measurement scales
ScalesAbsolute numerical meaning to attributes?Can perform most math operations?
RatioYes.Yes.
IntervalTo the interval; zero point is arbitrary.Addition and subtraction.
RankNo.Counting and ordering.
NominalNo.Counting only.

In this article I will focus on data collected using a nominal scale of measurement and the inferential techniques appropriate for nominal data.

Use the nominal scale

Almost all Web users -- designers, customers, sysadmins -- are familiar with the nominal scale. Web polls and access logs are simiiar in that a nominal scale is often used as the measurement scale. In Web polls, users often measure preferences by asking people to choose a response option (such as, "Do you prefer Brand A, Brand B, or Brand C?"). Data is summarized by counting the frequency of various types of responses.

In a similar vein, a common way to measure Web site traffic is by assigning each hit or visit to the day of the week it occurred and counting the numbers of hits or visits that occurred on each day. Also, you can (and do) count hits by browser type, operating system type, country of origin -- any categorical dimension you can imagine.

Because Web polls and access statistics both involve counting the number of times data falls into a particular qualitative category, they can be analyzed with similar non-parametric statistical tests (tests that allow you to make inferences based on distribution shape rather than population parameters).

David Sheskin, in his Handbook of Parametric and Non-Parametric Statistical Procedures (p.19, 1997), differentiates between parametric and non-parametric tests in this way:

The distinction employed in this book for categorizing a procedure as parametric versus a nonparametric test is primarily based on the level of measurement represented by the data that are being analyzed. As a general rule, inferential statistical tests which evaluate categorical/nominal data and ordinal/rank-order data are categorized as nonparametric tests, while those tests that evaluate interval data or ratio data are categorized as parametric tests.

Non-parametric tests are also useful when certain assumptions underlying a parametric test are questionable; they are powerful in detecting population differences when parametric assumptions are not satisfied. In the case of Web polls, I am using a non-parametric analysis procedure because Web polls often use a nominal scale to record voter preference.

I am not suggesting that Web polls and Web access statistics should always use a nominal measurement scale, or that non-parametric statistical tests are the only ones that can be used for the analysis of this data. It is easy to imagine polls and surveys, for example, that would require users to provide numerical ratings (1 to 100) for each option and for which parametric statistical tests would be appropriate.

Still, many Web data streams involve compiling categorical count data and that data (measured using more powerful measurement scales) can be turned into nominal scale data by defining intervals (such as, 17 to 21) and assigning each data point to an interval (such as, "young adult"). The ubiquity of frequency data, embedded in the experience of Web developers, makes focusing on non-parametric statistics an appropriate starting point to learn how to apply inferential techniques to data streams.

To keep this article to a manageable size, I'll limit the discussion of the analysis of Web data streams to Web polls. Remember, however, that many Web data streams can be represented in terms of nominal count data and that the inferential techniques I discuss will allow you to go beyond reporting simple count data.


Start with the sampling

Imagine you run a weekly poll on your site -- www.NovaScotiaBeerDrinkers.com -- asking members for their opinions on various topics. You've created a poll that asks members their favorite brand of beer (in Nova Scotia, there are three well-known brands: Keiths, Olands, and Schooner). So the survey is as inclusive as possible, you include "Other" among the responses.

You receive 1,000 responses and observe the results in Table 1. (The results shown in this article are for demonstration purposes only and not based on actual surveys.)

Table 1. The beer poll
KeithsOlandsSchoonerOther
285 (28.50%)250 (25.00%)215 (21.50%)250 (25.00%)

The data appears to support the conclusion that Keiths is the most popular brand among Nova Scotia residents. Based on these numbers, can you draw this conclusion? In other words, can you make an inference about the population of Nova Scotia beer drinkers on the basis of results obtained from the sample?

Many factors related to how the sample was collected could render your relative popularity inferences incorrect. Perhaps the sample consists of an inordinate number of employees of Keith's Brewery; perhaps you didn't properly guard against multiple votes by one person who may have biased the outcome; perhaps those who elected to vote are different from those who elected not to vote; perhaps the online voters are different from the offline voters.

Most Web polls are subject to such interpretive difficulties. These interpretive difficulties arise when you try to draw conclusions about a population parameter from a sample statistic. From an experimental design point of view, one of the first questions to ask before you collect data is whether you can take steps to help ensure that your sample is representative of the population of interest.

If drawing conclusions about the population of interest is your motivation for a Web poll (versus entertainment for site visitors), then you should implement techniques to ensure one vote per person (so that, they must login with a unique ID to vote) and randomize the selection sample of voters (for instance, select a random subset of members and e-mail them encouragement to vote).

Ultimately, the aim is to eliminate, or at least reduce, various biases that might impair your ability to draw inferences about your population of interest.


Test the hypothesis

Assuming that the sample of Nova Scotia beer drinkers is not biased, can you now conclude that Keith's is the most popular brand?

To answer this question, consider a related question: If you were to obtain another sample of Nova Scotia beer drinkers, would you expect to see exactly the same results? Realistically, you would expect some variability in the observed outcomes from sample to sample.

Given this expected sampling variability, you might wonder whether the observed brand preferences might be better accounted for by random sampling variability rather than by reflecting real differences in the population of interest. In statistical terminology, this sampling variability explanation is called the null hypotheses. (The null hypothesis is designated by the symbol Ho.) In this instance, formulate it as the statement that the expected number of responses is the same accross all categories of response:

Ho: # Keiths = # Olands = # Schooner = # Other

If you can rule out the null hypothesis, then you will make some progress towards answering the original question about whether Keith's is the most popular brand. The alternative hypothesis that you can then entertain is that the proportions are different in the population of interest.

This test-the-null-hypothesis-first logic applies at multiple stages in the analysis of poll data. Ruling out the null hypothesis so you have no overall differences in your data, you may then proceed to test a more specific null hypothesis, namely that no difference exists between Keith's and Schooner or between Keith's and all other brands.

The reason you proceed by testing the null hypothesis rather than directly assessing the alternative hypothesis is because it is easier to statistically model what one would expect to observe under the null hypothesis. Next, I'll demonstrate how to model what would be expected under the null hypothesis so that I can compare the observed results to what would be expected under the null hypothesis.


Model the null hypothesis: The Chi Square statistic

So far you have summarized the results of your Web poll using a table that reports frequency counts (and percentages) for each response option. To test the null hypothesis (that no difference exists between table cell frequencies), it is easier to compute an overall measure of how much each table cell deviates from the value you would expect under the null hypothesis.

In the case of this beer poll, the expected frequency under the null hypothesis is the following:

Expected Frequency = Number of Observations / Number of Response Options 
Expected Frequency = 1000 / 4 
Expected Frequency = 250

To compute an overall measure of how much the responses deviate from the expected frequency per cell, you can sum up all the differences into an overall measure of how much the observed frequencies differ from the expected frequencies: (285 - 250) + (250 - 250) + (215 - 250) + (250 - 250).

If you do this, you find the the expected frequency is 0 because deviations from a mean always sum to 0. To get around this problem, square all the difference scores (hence the square in Chi Square). Finally, to make the score comparable across samples with different numbers of observations (in other words, to standardize it), divide by the expected frequency. So, the formula for the Chi Square statistic looks like this ("O" means "observed frequency" and "E" equals "expected frequency"):

Figure 1. The formula for the Chi Square statistic
Figure 1. The formula for the Chi Square statistic

If you calculate the Chi Square statistic for the beer poll data, you obtain a value of 9.80. To test your null hypothesis, you want to know the probability of obtaining a value this extreme under the assumption that it is due to random sampling variability. To find this probability, you need to understand what the sampling distribution for Chi Square looks like.


Look at the Chi Square sampling distribution

(The reference image for the following graphic comes from the online NIST/SEMATECH Engineering Statistics Internet Handbook.)

Figure 2. Chi Square graphs
Figure 2. Chi Square graphs

In each of the graphs, the bottom axis reflects the size of an obtained Chi Square score (range showing is 0 to 10). The left axis shows the probability, or relative frequency of occurrence, of various Chi Square values.

As you study these Chi Square graphs, note that the shape of the probability functions change when you vary the degrees of freedom, or df, in your experiment. In the case of poll data, the degrees of freedom is computed by noting the number of response options in the poll (k) and subtracting 1 from that value (df = k - 1).

In general, the probability of obtaining a large Chi Square value goes down as you increase the number of response options in your study. This is because as you add response options, you increase the number of squared difference scores -- (Observed - Expected)2 -- you can sum over. So, as you add response options, the statistical probability of obtaining a large Chi Square value should increase and the probability of obtaining smaller Chi Square value decreases. This is why the shape of the Chi Square sampling distribution changes for different df values.

Also, note that you are generally not interested in the point probability of the Chi Square outcome, but rather are interested in the summed area of the curve falling to the right of the obtained value. This tail probability tells you whether obtaining a value as extreme as the one you observe is likely (such as a large tail area) or not (a small tail area). (In practice, I don't use such graphs to compute tail probabilities because I can implement mathematical functions to return the tail probability for a given Chi Square value. This is performed in the Chi Square program that I discuss later in this article.)

To gain further insight into how these graphs were derived, look at how you can simulate the contents of the graph corresponding to df = 2 (which implies k = 3). Imagine putting the numbers 1, 2, and 3 in a hat, shaking it, selecting a number, and recording the selected number for that trial. Run this experiment for 300 trials and compute the frequencies at which 1, 2, and 3 occur.

Each time you run this experiment you should expect a slightly different frequency distribution for the outcomes that reflects sampling variability and is not a real bias among the response alternatives.

The Multinomial class that follows implements this idea. You initialize the class with values indicating the number of experiments you want to run, the number of trials per experiment, and the number of options per trial. The outcome of each experiment is recorded in an array called Outcomes.

Listing 1. Multinomial class outcomes
<?php

// Multinomial.php

// Copyright 2003, Paul Meagher
// Distributed under LGPL  

class Multinomial {

  var $NExps;
  var $NTrials;
  var $NOptions;
  var $Outcomes = array();

  function Multinomial($NExps, $NTrials, $NOptions) {
    $this->NExps    = $NExps;
    $this->NTrials  = $NTrials;
    $this->NOptions = $NOptions;
    for ($i=0; $i < $this->NExps; $i++) {
      $this->Outcomes[$i] = $this->runExperiment();      
    }
  }
  
  function runExperiment() {
    $Outcome = array();
    for ($i = 0; $i < $this->NExps; $i++){
      $choice = rand(1,$this->NOptions);
      $Outcome[$choice]++;
    }
    return $Outcome;
  }     
     
}
?>

Note that the runExperiment method is the critical part of the script and implements the random choice of a response alternative and keeps track of which choices have been made so far in the simulated experiment.

To find the sampling distribution of the Chi Square statistic, simply take the outcome of each experiment and compute a Chi Square statistic for that result. This Chi Square statistic will vary from experiment to experiment due to random sampling variability.

The following script writes the obtained Chi Square statistic from each experiment to an output file for later plotting.

Listing 2. Writing the obtained Chi Square statistic to output file
<?php

// simulate.php

// Copyright 2003, Paul Meagher
// Distributed under LGPL  

// Set time limit to 0 so script doesn't time out
set_time_limit(0);

require_once "../init.php";
require PHP_MATH . "chi/Multinomial.php";
require PHP_MATH . "chi/ChiSquare1D.php";

// Initialization parameters
$NExps    = 10000;
$NTrials  = 300;
$NOptions = 3;

$multi = new Multinomial($NExps, $NTrials, $NOptions);

$output = fopen("./data.txt","w") OR die("file won't open");
for ($i=0; $i<$NExps; $i++) {    
  // For each multinomial experiment, do chi square analysis
  $chi = new ChiSquare1D($multi->Outcomes[$i]);

  // Load obtained chi square value into sampling distribution array 
  $distribution[$i] = $chi->ChiSqObt;  

  // Write obtained chi square value to file
  fputs($output, $distribution[$i]."\n");  
}
fclose ($output);

?>

To visualize the results expected from running this experiment, the simplest route for me was to load the data.txt file into the open source statistics package R, run the histogram command, and edit the plot in a graphics editor, as in the following:

x = scan("data.txt") 
hist(x, 50)

As you can see, the histogram of these Chi Square values approximate the continuous Chi Square distribution presented above for df = 2.

Figure 3. Values approximate continuous distribution for df = 2
Figure 3. Values approximate continuous distribution for df = 2

In the next few sections, I focus on explaining how the Chi Square software used in this simulation works. Ordinarily the Chi Square software would be used to analyze real nominal data (such as Web poll results, weekly traffic reports, or customer brand-preference reports) instead of the simulated data you used. You might also be interested in other outputs that the software generates, such as summary tables and tail probability.


Chi Square instance variables

The php-based Chi Square software package I developed consists of classes for analyzing frequency data that is classified along one or two dimensions (ChiSquare1D.php and ChiSquare2D.php). I'll limit my discussion to explain how the ChiSquare1D.php class works and how it can be applied to one-dimensional Web poll data.

Before moving, I should note that classifying data along two dimensions (for instance, beer preference by gender) allows you to begin to explain your outcomes by looking for systematic relationships, or conditional probabilities, among your "contingency" table cells. While much of the discussion that follows will help you to understand how the ChiSquare2D.php software works, additional experimental, analysis, and visualization issues that are not discussed in this article are necessary to address before using this class.

Listing 3 looks at a fragment of the ChiSquare1D.php class which consists of:

  1. A file that is included
  2. The class instance variables
Listing 3. Fragment of Chi Square class with included file and instance variables
<?php

// ChiSquare1D.php

// Copyright 2003, Paul Meagher
// Distributed under LGPL  

require_once PHP_MATH . "dist/Distribution.php";

class ChiSquare1D {
  
  var $Total;
  var $ObsFreq  = array(); // Observed frequencies
  var $ExpFreq  = array(); // Expected frequencies
  var $ExpProb  = array(); // Expected probabilities  
  var $NumCells;
  var $ChiSqObt;    
  var $DF;
  var $Alpha;  
  var $ChiSqProb;      
  var $ChiSqCrit;  

}

?>

A file called Distribution.php is included at the top of this script in Listing 3. The included path incorporates a PHP_MATH constant set in an init.php file that is assumed to have been included in a calling script.

The included file, Distribution.php, contains methods that generate sampling-distribution statistics for several commonly used sampling distributions (Student T, Fisher F, Chi Square). The ChiSquare1D.php class needs access to the Chi Square methods in Distribution.php to compute the tail probability of an obtained Chi Square value.

The list of instance variables in this class is worth noting because they define the result object that is generated by the analysis procedure. This result object contains all the important details about the test, including three critical Chi Square statistics -- ChiSqObt, ChiSqProb, and ChiSqCrit. For details on how each instance variable is computed, you can look at the constructor method for the class where all these values are derived.


The Constructor: Backbone of the Chi Square test

Listing 4 looks at the Chi Square constructor code which forms the backbone of the Chi Square test.

Listing 4. The Chi Square constructor
<?php

class ChiSquare1D {     

  function ChiSquare1D($ObsFreq, $Alpha=0.05, $ExpProb=FALSE) {   
    $this->ObsFreq    = $ObsFreq;        
    $this->ExpProb    = $ExpProb;            
    $this->Alpha      = $Alpha;
    $this->NumCells   = count($this->ObsFreq);        
    $this->DF         = $this->NumCells - 1;        
    $this->Total      = $this->getTotal(); 
    $this->ExpFreq    = $this->getExpFreq();
    $this->ChiSqObt   = $this->getChiSqObt();    
    $this->ChiSqCrit  = $this->getChiSqCrit();    
    $this->ChiSqProb  = $this->getChiSqProb();     
    return true;  
  }

}

?>

Four noteworthy aspects of the constructor method are:

  1. The constructor accepts an array of observed frequencies, an alpha probability cutoff score, and an optional array of expected probabilities.
  2. The first six lines involve relatively simple assignments and computed values that are recorded so a complete result object is available to calling scripts.
  3. The final four lines do the bulk of the work in obtaining the Chi Square statistics you are most interested in.
  4. The class implements only the Chi Square test logic. No output methods are associated with this class.

You can examine the class methods included in the code download for this article to find out more about how each result object value is computed (see Resources).


Handle output issues

The code in Listing 5 shows how easy it is to perform a Chi Square analysis using the ChiSquare1D.php class. It also demonstrates the handling of output issues.

The script invokes a wrapper script called ChiSquare1D_HTML.php. The purpose of this wrapper script is to separate the logic of the Chi Square procedure from its presentational aspects. The _HTML suffix indicates that the output is intended for a standard Web browser or other HTML-rendering device.

Another purpose of the wrapper script is to organize the output in ways that facilitate understanding the data. Towards this end, this class contains two methods for displaying the results of the Chi Square analysis. The showTableSummary method displays the first output table shown following the code (Table 2), while the showChiSquareStats displays the second output table (Table 3).

Listing 5. Organizing data with a wrapper script
<?php

// beer_poll_analysis.php

require_once "../init.php";

require_once PHP_MATH . "chi/ChiSquare1D_HTML.php";

$Headings = array("Keiths", "Olands", "Schooner", "Other");

$ObsFreq  = array(285, 250, 215, 250);
$Alpha    = 0.05;
$Chi      = new ChiSquare1D_HTML($ObsFreq, $Alpha);

$Chi->showTableSummary($Headings);
echo "<br><br>";
$Chi->showChiSquareStats();

?>

The script generates the following output:

Table 2. Expected frequencies and variances from running the wrapper script
KeithsOlandsSchoonerOtherTotals
Observed2852502152501000
Expected2502502502501000
Variance4.900.004.900.009.80
Table 3. Various Chi Square statistics from running the wrapper script
StatisticDFObtainedProbCritical
Chi Square39.800.027.81

Table 2 displays the expected frequencies and the variance measure for each cell, (O - E)2 / E. The sum of the variance scores is equal to the obtained Chi Square (9.80) value that is reported in the lower right cell of the summary table.

Table 3 reports various Chi Square statistics. It includes the degrees of freedom used in the analysis and the obtained Chi Square value is reported again. The obtained Chi Square value is re-expressed as a tail probability value -- in this case, 0.02. This means that the probability of observing a Chi Square value as extreme as 9.80 under the null hypothesis is 2 percent (which is quite a low probability).

Most statisticians would not argue if you decided to reject the null hypothesis that the results can be accounted for in terms of random sampling variability from the null distribution. It is more likely that your poll results reflect a real difference in brand preference among the population of Nova Scotia beer drinkers.

Just to confirm this conclusion, you can compare the obtained Chi Square value to the Critical value.

Why is the Critical value important? The Critical value is based upon the significance level (or alpha-cutoff level) set for the analysis. The alpha-cutoff value is conventionally set at 0.05 (and used for the above analysis). This setting is used to find the location (or critical value) on the Chi Square sampling distribution that includes a tail area equal to the alpha-cutoff value (0.05).

In this study, the obtained Chi Square value was larger then the Critical value. This means that the threshold for retaining the null hypothesis explanation was exceeded. The alternative hypothesis -- that a difference in proportions exists in the population -- is statistically more likely to be true.

In the automated analysis of data steams, an alpha-cutoff setting could be used to set an output filter for a knowledge-discovery algorithm (such as Chi Square Automatic Interaction Detection, or CHIAD) that does not have the benefit of detailed human guidance in discovering real and useful patterns.


Repoll

Another interesting application of the one-way Chi Square test is to repoll to see if responses have changed.

Imagine that you were to do another Web poll of Nova Scotia beer drinkers after a period had elapsed. You again ask about their favorite brand of beer and now observe the following:

Table 4. A new beer poll
KeithsOlandsSchoonerOther
385 (27.50%)350 (25.00%)315 (22.50%)350 (25.00%)

Recall that the past data looked like this:

Table 1. The old beer poll, yet again
KeithsOlandsSchoonerOther
285 (28.50%)250 (25.00%)215 (21.50%)250 (25.00%)

The obvious difference between the poll outcomes is that the first poll had 1,000 observations and the second one had 1,400 observations. The main effect of these additional observations is a 100-point increase in the frequency count for each response alternative.

When ready to do the analysis of the new poll, you can choose to analyze the data using the default method of computing the expected frequencies or you can initialize the analysis with the expected probability of each outcome based on the proportions observed in the previous poll. In the second case, you load the previously obtained proportions into an expected probability array ($ExpProb) and use them to compute the expected frequency values for each response option.

Listing 6 shows the beer-poll analysis code for detecting changing preferences:

Listing 6. Detecting changing preferences
<?php

// beer_repoll_analysis.php

require_once "../init.php";

require PHP_MATH . "chi/ChiSquare1D_HTML.php";

$Headings = array("Keiths", "Olands", "Schooner", "Other");

$ObsFreq  = array(385, 350, 315, 350);
$Alpha    = 0.05;
$ExpProb  = array(.285, .250, .215, .250);

$Chi = new ChiSquare1D_HTML($ObsFreq, $Alpha, $ExpProb);

$Chi->showTableSummary($Headings);
echo "<br><br>";
$Chi->showChiSquareStats();

?>

Tables 5 and 6 show the HTML output that the beer_repoll_analysis.php script generates:

Table 5. Expected frequencies and variances from running beer_repoll_analysis.php
KeithsOlandsSchoonerOtherTotals
Observed3853503153501400
Expected3993503013501400
Variance0.490.000.650.001.14
Table 6. Various Chi Square statistics from running beer_repoll_analysis.php
StatisticDFObtainedProbCritical
Chi Square31.140.777.81

Table 6 shows you have a 77 percent probability of obtaining the Chi Square value of 1.14 under the null hypothesis. We cannot reject the null hypothesis that the preferences of Nova Scotia beer drinkers have changed since your last poll. Any discrepancies between the observed and expected frequencies can be accounted for as expected sampling variability from the same population of Nova Scotia beer drinkers. This null finding should not be a surprise given that the transformation of the original poll results was just to add a constant of 100 to each previous poll outcome.

You can imagine, however, that the results might have been different and that they may have suggested a different brand of beer was becoming more popular (by noting the size of the variance reported below each column in Table 5). You can further imagine that such a finding would have significant financial implications for the breweries in question since bar owners tend to stock the most popular beer in their locality.

These results would be subjected to intense scrutiny by brewery owners who would question the appropriateness of the analytic procedures and experimental methodology; in particular, they would question the representativeness of the samples. If you plan to conduct a Web experiment that may have significant practical implications, you need to pay equal attention to the experimental methodologies you use to collect the data and the analysis techniques you employ to make inferences from your data.

So not only can this article give you a good grounding so you can increase your effective understanding of Web data, it can offer some advice on how to defend your selection of statistical test and provide additional legitimacy to the conclusions you draw from the data.


Apply the knowledge

In this article, you have learned how to apply inferential statistics to the ubiquitous frequency data used to summarize Web data streams, focusing on the analysis of Web poll data. However, the simple one-way Chi Square analysis procedure discussed can be fruitfully applied to other types of data streams (access logs, survey results, customer profiles, customer orders) to turn raw data into actionable knowledge.

I also covered the desirability, when applying inferential statistics to Web data, to regard data streams as outcomes of Web experiments so that you increase the likelihood of invoking experimental design considerations in making your inferences. Often you cannot make inferences because you do not have adequate controls in your data-collection process. This can change, however, if you become more proactive in applying experimental design tenets to your Web data collection procedures (such as, randomize the selection of voters in your Web polls).

Finally, I demonstrated how to simulate the Chi Square sampling distribution for different degrees of freedom, going beyond simply commenting on its derivation. In doing so, I also demonstrated a workaround (simulating the sampling distribution for experiments using a small $NTrials value) to the prohibition of using the Chi Square test in cases in which the expected frequency of measurement categories is less than 5 (in other words, a small N experiment). So, instead of just using the df from the study to compute the probability of a sample outcome, for small numbers of trials, you might also need to use the $NTrials value as a parameter to evaluate the probability of the observed Chi Square result.

It is worth pondering how you might analyze small N experiments because often you might want to analyze your data before data collection is complete -- when each observation is costly, when observations take a long time to obtain, or simply because you are curious. These two questions are good to keep in mind when attempting this level of Web-data analysis:

  • Are you justified in making inferences under conditions of small N or not?
  • Can simulation help you determine what inferences to draw under these circumstances?

Download

DescriptionNameSize
Code samplewa-phpolla/phpmath_003.tar.gz---

Resources

  • Download the source code used in this article.
  • Find future enhancements of this code at www.phpmath.com.
  • Read "Simple linear regression with PHP: Part 1," by Paul Meagher. It offers examples of how to develop and apply a library of analytic models, focusing on a Simple Linear Regression algorithm package using PHP as the implementation language (developerWorks, March 2003).
  • Continue with "Simple linear regression with PHP: Part 2," as Paul Meagher shows how to add features to the package (from Part 1) to develop a useful data-analysis tool for small- to medium-sized datasets and includes more information about probability distribution functions (developerWorks, April 2003).
  • Take "Card sorting and cluster analysis" by Thomas Myer, an introductory-level tutorial designed to deliver card-sorting and cluster-analysis user-data-gathering statistical techniques to information architects and usability engineers (developerWorks, January 2001).
  • I would like to thank Dr. Tessema Astatkie for useful discussions on experimental methodology and Chi Square analysis; you'll find a cornucopia of articles on statistical analyses of data here.
  • Visit the R Project for Statistical Computing Web site for more on the open source statistics package R.
  • Explore the IBM Information Aggregation business pattern (or U2D pattern) for tools to capture, access, and manipulate data that is aggregated from multiple sources and tools to personalize data to suit user preferences, distill summary information from large volumes of data, use algorithms to identify trends hidden in data, and answer hypothetical "what-if" questions about potential business scenarios.
  • Review the TLTP/Steps Statistics Glossary 1.1 by Valerie Easton and John McCall. It offers a concise, interactive education in the definition of statistical terms in the following topic areas: presenting data; sampling; probability; confidence intervals; hypothesis testing; paired data, correlation, and regression; design of experiments; and more.
  • Find an introduction to the philosophy of experimentation and the part that statistics play in experimentation in the classic Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, by George Box, Stuart Hunter, and William G. Hunter (Wiley, 1978).
  • Read a classic work in the application of statistical procedures: Handbook of Parametric and Nonparametric Statistical Procedures, 3rd Ed., by David J. Sheskin (CRC Press, 2000).
  • Review the NIST/SEMATECH Engineering Statistics Internet Handbook, an invaluable reference to statistical analysis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Web development on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=11822
ArticleTitle=Take Web data analysis to the next level with PHP
publish-date=08052003