Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Apply probability models to Web data using PHP

PHP together with probability distributions build effective probability models

Paul Meagher (paul@datavore.com), CEO, Datavore Productions
Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in cognitive science and has spent the last six years developing Web applications. His current interests include statistical computing, data mining, content management, and e-learning.

Summary:  To help developers learn to fit the benefits of probability modeling into Web application development, Paul Meagher introduces you to basic concepts, techniques, and PHP-based tools that define the area of probability modeling and probability distributions. He demonstrates how to develop univariate probability models in PHP; discusses how to fit empirical data distributions to a theoretical probability distribution; and showcases an important tool for all this -- the Probability Distributions Library (PDL).

Date:  07 Oct 2003
Level:  Introductory PDF:  A4 and Letter (456 KB | 34 pages)Get Adobe® Reader®

Activity:  30354 views
Comments:  

One open source project that has received considerable attention in the last year is the SpamBayes project, a project that continues to provide one of the best examples of how probability theory can inform the design of applications to solve practical problems. The SpamBayes filtering engine uses machine learning and Bayesian inference techniques to compute the probability that a given piece of e-mail is spam.

This project is also interesting because the main exposure to software applications of probability theory are generally math-enabled applications such as statistics programs, and the project teaches you and me that many fruitful hybrid technologies can result from the cross-fertilization of traditional application domains with ideas and techniques from probability theory. To utilize such cross-fertilization, it is not necessary to learn advanced aspects of probability theory; some of the most elementary aspects of probability theory could be used today to inform the design of your next application.

In this article, I introduce you to some of the most basic concepts, techniques, and tools that define the area of probability modeling, focusing in particular on the role played by probability distributions in constructing univariate probability models. So you are able to use these concepts in practice, I will show you how to develop univariate probability models that are completely implemented in the popular and easy-to-use scripting language PHP. But the concepts are universal enough so that those who prefer other scripting languages will be able to understand and learn from the implementations as well.

In the first part of the article, I discuss three related concepts needed to construct univariate probability models (probability models based on a single random variable):

  1. What is a random variable?
  2. What is a frequency distribution?
  3. What is a probability distribution?

I then discuss the critical issue of how to fit an empirical data distribution to a theoretical probability distribution, demonstrating how you can use the ChiSquare goodness-of-fit test for this purpose.

After discussing concepts and techniques for constructing univariate probability models, I talk about an important software tool you will need to construct your probability models, a Probability Distributions Library (PDL). I demonstrate how to build a PDL in PHP and show how it can be used to model goal scoring in World Cup Soccer.

Finally, I discuss theory and future directions, as well as flag some random variables that Web developers should consider adopting.

Defining a random variable

The frequency distribution of a random variable can be represented graphically with the y-axis displaying the frequency and the x-axis displaying the range of values the random variable can take on. The graph in Figure 1 depicts an actual frequency distribution for the random variable "Male Height in Inches":


Figure 1. Frequency distribution for a random variable
Frequency distribution for a random variable

Intuitively speaking, a random variable is simply any variable whose value is determined in some way by chance. Because chance plays a role, each value that a random variable can take on will occur more or less frequently. A frequency histogram (such as in Figure 1) is a useful tool for understanding how frequently different values of a random variable occur.

When developing a probability model for a random variable, it is often more useful to express the expected frequency of different outcomes in terms of probabilities that vary between 0 and 1 instead of using raw frequency counts. You can derive these y-axis probabilities by computing the number of observations that fall within a given interval and then dividing that number by the total number of observations. If you do this for each interval, you will get the probability distribution for your random variable (as shown in Figure 2).

Note that the male height probability distribution in Figure 2 is identical in shape to the frequency distribution. The only difference is that y-axis now measures probability density instead of frequencies.


Figure 2. Probability distribution for a random variable
Probability distribution for a random variable

Once you have clearly defined the random variable you are interested in, the next step is to measure the values your random variable generates. Ultimately you want to use this empirical information to construct the observed probability distribution for your random variable.

The graph of the observed probability distribution may immediately suggest a theoretical probability distribution (such as, a normal distribution). You can use a theoretical distribution, in lieu of the observed probability distribution, to derive inferences about the probability of observing various types of outcomes of concern.

In other words, after you clearly define your random variable (for instance, customer orders per week), then gather measurements of it (through experiments, questionnaires, sales logs, access logs, data mining), you can proceed to the model-fitting stage to:

  • Plot your observed probability distribution
  • Find a theoretical probability distribution to use in place of the observed probability distribution

Model fitting

In the model-fitting stage, your goal is to replace the observed probability distribution with a better understood theoretical probability distribution. This substitution enables you to more easily make probability statements about your random variable (variables such as "What is the probability of meeting someone seven feet tall?"; "What is the probability of getting 10 orders this week?"; or "What is the probability of getting a visitor to the Web site in the next 10 minutes?").

When you look at the observed probability distribution for male height, it appears to have a symmetrical bell shape that is reminiscent of the plot for a normally distributed random variable. This observation suggests that you should do your model fitting by comparing the observed distribution of heights to the distribution of heights predicted using a normal probability distribution.

If you can establish that the difference between the observed and predicted height distributions is small, then you can use the normal distribution to assign probability values for various statements about male height (or the probability that a hard drive will fail within a certain period of time, or the probability of having X number of motor vehicle accidents this week, and so on).

Before you can compare the observed distribution with the distribution predicted using the normal distribution, you must first compute the mean and standard deviation of the observed distribution. This is because the normal probability distribution function which generates the best-fitting normal distribution accepts a mean and standard deviation as adjustment parameters.

In the normal distribution function in Figure 3, you can see that the mean deviation (Mu, mu) and the standard deviation (sigma, sigma) appear to be fixed parameters in the formula:


Figure 3. Formula for normal distribution function
Formula for normal distribution function

The formula returns the probability density associated with each height value.

The mean and standard deviation parameters are used to tweak the location and shape of the normal distribution. The observed estimates are used as the most likely candidates for the mean and standard deviation of the best-fitting normal distribution curve. Graphically speaking, Figure 4 demonstrates that supplying the parameters should cause the theoretical normal distribution (red line) to overlay the observed distribution so that goodness of fit can be visually assessed.


Figure 4. Using deviation parameters to visually assess fit
Using deviation parameters to visually assess fit

The red line is the plot of the probability densities for each height value between 64 and 78 using a mean of 70.31 and a standard deviation of 2.61 as the formula parameters.

Use the following PHP script to generate the normal density values as depicted by the red line. You might also find it instructive to compare the textbook formula (Figure 3) with the PHP implementation of this formula. (Note in particular that the fixed parameters, the mean and standard deviation, can be represented as instance variables which are set via a constructor, while the x values are supplied as the variable argument to the normal probability density function, or PDF.)


Listing 1. PHP script for generating normal density values
<?php

class Normal { 
  
 var $mean;
 var $stdev;
    
 function Normal($mu, $sigma) {
  $this->mean  = $mu;
  $this->stdev = $sigma;
 }
  
 function PDF($x) {
  $denominator = sqrt(2 * pi()) * $this->stdev;        
  $numerator   = exp(-($x - $this->mean) * ($x - $this->mean)
        / ( 2 * $this->stdev));
  $density     = $numerator / $denominator;
  return $density;
 }

}

$norm = new Normal(70.31, 2.61);

echo $norm->PDF(70);

// answer is 0.15

?>

To assess how good the fit is between the observed probability distribution and a normal distribution, you could generate expected frequencies for each height interval based on what would be expected from a normally distributed random variable with a mean of 70.31 and a standard deviation 2.61. You then could compare the difference between the observed and expected frequencies for each height interval.

If you summed the square of each such difference score and divided by the number of difference scores, you could use the size of this value to indicate whether a normal distribution is a good fit for the data. This obtained value is known as the obtained Chi Square value. The Chi Square value can be used to analyze Web polls, stats, and other data streams, as well as to assess goodness of fit between obtained and theoretical distributions. (See Resources for an article on the use of the Chi Square test.)

The Chi Square test is not the only test you can use to establish goodness-of-fit between theoretical and observed distributions. In the case of male height, a better test to apply is the Kolmogorov-Smirnov test or the Anderson-Darling test. The Kolmogorov-Smirnov test is designed to test the hypothesis that a given data set could have been drawn from a given distribution. Unlike the Chi Square test, it is primarily intended for use with continuous distributions and is independent of arbitrary computational choices such as bin width. The Anderson-Darling test (a modification of Kolmogorov-Smirnov) is used to test if a sample of data came from a population with a specific distribution. It gives more weight to the tails than does Kolmogorov-Smirnov. Kolmogorov-Smirnov is distribution-free in the sense that the critical values do not depend on the specific distribution being tested. Anderson-Darling makes use of the specific distribution in calculating critical values, potentially affording the advantage of allowing a more sensitive test and suffering the disadvantage that critical values must be calculated for each distribution.

My reason for mentioning other common goodness-of-fit tests is that you want to use the best test for the job. The Chi Square test has the virtue that it can be used to assess model fit for most distributions, although it may be less sensitive than other goodness-of-fit tests for particular distributions.

Next, look at how to perform goodness of fit using the Chi Square test. I'll apply it to an example for which it is arguably more appropriate than male height (for example, Anderson-Darling normality test might be better) -- determining whether the numbers generated by PHP's mt_rand function can be fit to a uniform distribution.


Is mt_rand() really random?

To obtain a pseudo-random number using PHP's random-number generator, call the mt_rand() function and it will return a value between 0 and RAND_MAX in which RAND_MAX is a system-defined upper limit (which you can inspect by calling the mt_getrandmax() function).

The mt_rand() function uses the Mersenne Twister algorithm and is four times faster and better characterized than PHP's older rand() function.

Before you use PHP's mt_rand() in your probability models, you might want to convince yourself that the mt_rand() function works correctly. How could you do this?

Most developers are content to write a script, get it to generate a few random values, and then accept that it is working correctly if they don't notice any obvious biases in the numbers that are appearing. This eyeball analysis might convince you, but it won't, as they say, convince the lawyers.

One approach to find more convincing evidence is to precisely define what it means for a sequence of numbers to be random. A random sequence of numbers should have many properties, but one of the most important properties is that each number in the range of possible values should have an equal likelihood of appearing at each point in the sequence.

A way to measure whether this is true is by counting the number of times each value occurs and graphing the frequency counts for each value. The resulting graph should approximate a uniform distribution of counts for each value in your range. If you limit the range of allowable sequence numbers from 0 to 9 and generate a sequence of 1,000 numbers, then the graph should approximate the discrete uniform distribution depicted in Figure 5.


Figure 5. Uniform distribution for truly random numbers
Uniform distribution for truly random numbers

To test whether PHP's mt_rand() function generates a uniform distribution of random values, I've created a script that uses the Chi Square test to determine this. The first half of the script is primarily concerned with creating a frequency distribution from output of mt_rand(). The second half performs the ChiSquare test.

The test involves setting the alpha cutoff to use for computing a critical Chi Square value. If the obtained Chi Square value exceeds the critical Chi Square value, then you would reject the null hypothesis that the mt_rand() values come from a uniform distribution. In fact, you would not reject the null hypothesis if mt_rand() is working as it should.


Listing 2. PHP Chi Square script to determine the accuracy of mt_rand()
<?php 

/**
* @package PHPMath_ChiSquare
*/

require_once "PHPMath/ChiSquare/ChiSquare1D_HTML.php"; 

/**
* Script tests whether mt_rand function generates a sequence of
* values that can be fit to a discrete uniform distribution 
* @version 1.0
* @author Paul Meagher
*/

// Set range of random values that you want to generate
$Low  = 0;     
$High = 9;     

// Set number of random values you want to generate 
$Iterations = 1000;   

// Zero the frequency distribution array
$FreqDist = array(); 

// Compute probability of each range value
$NumVals = count(range($Low, $High)); 
$Prob    = 1 / $NumVals; 

for($i = 0; $i < $Iterations; $i++) { 
  $RandValue = mt_rand($Low, $High); 
  $FreqDist[$RandValue]++; 
} 

for($i=0; $i < $NumVals; $i++) { 
  $ObsFreq[$i] = $FreqDist[$i + $Low]; 
  $ExpProb[$i] = $Prob;   
}   

$Alpha    = 0.05; 
$Chi      = new ChiSquare1D_HTML($ObsFreq, $Alpha, $ExpProb); 
$Headings = range($Low, $High); 
echo "<p>". $Chi->showTableSummary($Headings) ."</p>";
echo "<p>". $Chi->showChiSquareStats() ."</p>";

?> 

The following table shows a sample output from this script. As the obtained Chi Square value of 7.90 is less than the critical value of 16.92, you cannot reject the null hypothesis that your observed frequencies are different than the frequencies expected under the assumption that you are sampling from a uniform distribution.


Table 1. Output from PHP Chi Square script
 0123456789Totals
Observed91115901041019510511388981000
Expected1001001001001001001001001001001000
Variance0.812.251.000.160.010.250.251.691.440.047.90
StatisticDFObtainedProbCritical
Chi Square97.900.5416.92

It can be instructive to run this script a number of times and observe that on some occasions you reject the null hypothesis. Why do you think this occurs? How often can this occur before you need to reject the null hypothesis? And is there a tool to help make these determinations?


Designing a PDL

I've introduced some important concepts such as frequency distribution, probability distribution, and goodness-of-fit testing. Now, I want to talk about an important tool you'll need to have at your disposal for ongoing probability-based modeling.

The Probability Distributions Library (PDL) can best be explained by a simple exercise -- constructing a feature list for a PDL. You'll begin building a PHP-based PDL, then provide some simple examples of how it can be used to construct probability models.

Before I built my own PHP-based PDL, I studied the feature set and code base of several existing PDLs. The two PDLs that influenced my own approach belong to the R package and the JSci packages. Let's discuss their respective strengths and highlight the functional and source code features that I felt were important to incorporate into my own PHP-based PDL.

R probability distributions

R is part of the open source R Project and is a high-level interactive environment for performing statistical work. The API for using the probability distributions component is optimized for interactive statistical work in the sense that all commands are short (the average command length is six characters) and the naming conventions for accessing particular probability distribution functions are regular (distribution names are prefixed by d, p, q, or r to indicate what type of distribution function is being requested).

For example, to invoke the distribution functions for the Poisson distribution (which models some discrete random variables, such as a count of the number of events that occur in a certain time interval or spatial area), you would use these four commands:

dpois(x, lambda, log = FALSE)

ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)

qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)

rpois(n, lambda)

Similarly, to invoke the distribution functions for the Exponential distribution (a relatively simple, commonly used distribution used to model the behavior of units that have a constant failure rate; more on this later in the article), you would use these four commands:

dexp(x, rate = 1, log = FALSE)

pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE)

qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE)

rexp(n, rate = 1)

As you can see, not much typing is involved (a useful feature for interactive computing), so once you understand what the d, p, q, and r prefixes mean (which is not initially obvious), you can easily infer how to access corresponding distribution functions for other probability distributions.

The d prefix stands for Density Function and signals to R that you want the probability value associated with a particular x value -- for instance, Prob[X = x]. In other words, given a contiguous range of x values, the density function will give you a corresponding range of probability values that could be used to graph the shape of the probability distribution for the supplied range of x values. Most textbook authors refer to this distribution function as the Probability Density Function or PDF.

The p prefix stands for Probability Function and signals to R that you want the probability that your random variable is less than or equal to some x value -- for example, Prob[X <= x]. This function is the one that users often care about the most because it is used in statistical tests to evaluate the probability of some observed outcome. Most textbook authors refer to this distribution function as the Cumulative Distribution Function or CDF.

The q prefix stands for Quartile Function and signals to R that you want the inverse of the Cumulative Distribution Function. In other words, given a probability value such as 0.05, it finds the x value such that Prob[x >= X] = 0.05. It is commonly used to find a "critical value" for your study outcome such that you will reject the null hypothesis if you observe a result greater than your critical value. Instead of calling this distribution function the Quartile Function as R does, I prefer to call it the Inverse Cumulative Distribution Function or Inverse CDF.

The r prefix stands for Random Number Generating Function and signals to R that you want it to generate a number or numbers distributed according to the specified distribution. It is very useful in simulation work.

A final noteworthy aspect of the R distribution functions is that they are vector oriented. If you supply more than one value in an argument slot, it returns more than one value. For example, if you want the critical values from an exponential distribution corresponding to probabilities of 0.1, 0.05, and 0.01, you can simply type this:

qexp(cbind(0.1,0.05,.01))

The command returns this list of critical values:

0.1053605   0.05129329   0.01005034

The vector orientation of the R distribution functions makes them convenient for both interactive and non-interactive use.

JSci probability distributions

JSci is a SourceForge project that espouses the following mission:

JSci is a set of open source Java packages. The aim is to encapsulate scientific methods/principles in the most natural way possible. As such, they should greatly aid the development of scientific-based software.

The JSci package is similar to the R package in that it implements a uniform set of distribution-related functions for all implemented probability distributions. Also like R, it offers a regular interface to these distribution-related functions for all implemented statistical distributions.

The JSci package does not, however, implement as many probability distributions as R; it does not include the random number generating functions for each distribution; and it is not vector oriented. From the point of view of coverage and functionality, the JSci package is not yet as extensive or powerful as the R statistical distributions library.

At the level of source code though, the JSci package is a well-crafted object-oriented library of probability distributions. It is at the source-code level that the JSci package shines and its architecture heavily influenced my own approach to coding a PHP-based library of probability distributions. Essentially, my design objective was to implement much of the same functionality as R's PDL, but to code it in a style more like the JSci approach.

The notable JSci source-code features I sought to emulate were the following:

  • All probability distribution classes reside in the same directory.
  • The probability distribution functions for a particular type of probability distribution reside in a single class file (for example, PoissonDistribution.java).
  • All probability distributions extend an abstract ProbabilityDistribution.java class. The ProbabilityDistribution.java class defines a set of methods that all specific probability distribution types are expected to instantiate. The architecture is simple and provides a straightforward framework for extending the library to new probability distributions.
  • The abstract ProbabilityDistribution.java class also contains other helper methods that all specific distribution classes can use. The purity of the object is diluted a bit by adding these helper methods here -- as more methods non-specific to the Probability Distributions object are added a separate class would likely be needed.
  • The constructor method for each probability distribution allows you to set the slowly changing parameters for your probability distribution and use these instantiated parameters in subsequent calls to the class methods. You do not, for example, need to continue supplying the mean and standard deviation parameters once you have instantiated your distribution with these parameters. With R you must keep supplying these parameters in your function calls. This difference boils down to the fact that JSci is a more object-oriented implementation of a PDL while R's PDL is more function oriented (but implements a consistent method interface that all probability distribution functions are expected to adhere to).
  • JSci has a more verbose, Java-like API for its PDL that is more functionally descriptive than the R API (what does runif() mean?), but it is not as well suited to interactive usage. In devising my own method name labels, I sought a functionally descriptive API that was as abbreviated as possible.

A final consideration that inspired my decision to base the source-code architecture more on JSci than R is that the JSci package is released under the LGPL license, a license which is more PHP-friendly than the GPL license under which R is usually released.


Probability distribution superclass

If you examine the JSci Probability Distribution superclass (ProbabilityDistribution.java), you would notice obvious similarities to the PHP version of this class shown in Listing 3. One major difference, however, is that the PHP version is designed so that it can co-exist nicely with other PEAR classes.

PEAR is short for PHP Extension and Application Repository and is the official structured library of open source code for PHP users. The PEAR Group also advocates a standard style for code written in PHP. The recommended PEAR coding style and good Java programming style have many similarities. These similarities mean that it is relatively easy to turn good Java code into PEAR-conformant code.

Three main issues arise and can cause some difficulties in porting code from Java to PHP4:

  • Lack of native support for namespaces in PHP means that my class names are longer than you might see in Java (such as PHPMath_ProbabilityDistribution_General).
  • Lack of native support for polymorphous constructors in PHP means that rather than declaring your class with a variable number of arguments which cause different constructors to be invoked, you try to achieve the same effect through setting argument defaults and doing different things depending on whether a default argument is supplied or what type of argument it is. Often I cannot achieve the same effect in PHP (except through ugly workarounds), so I might just implement the most commonly used constructor.

    Also, in PHP you do not need to statically define the type of your function arguments. When calling any PHP function for any given argument slot, you can pass in a single value or an array as your argument. Within such PHP functions, you can add logic that detects the argument type and uses this information to carry out different operations. In other words, you can use PHP's type-indifferent argument-passing protocol and type-detection code to achieve constructor polymorphism.

  • Lack of native support for advanced exception handling in PHP (which will change with PHP5) means that you have to rely upon the forward-compatible PEAR.php error handling class to flag and deal with errors.

While support for these OO features in PHP would be desirable, they are arguably not necessary; workarounds can achieve similar effects. The tradeoff is that PHP easier to learn and to use productively for solving Web scripting problems.

The probability distribution superclass (Listing 3) defines the methods that need to be instantiated by all probability distribution classes. It also defines methods and constants that can be used by child classes.


Listing 3. Probability distribution superclass, PHPMath_ProbabilityDistribution_General
<?php

/**
* @package PHPMath_ProbabilityDistribution
*/

define("PHPMATH_MAX_FLOAT", 3.40282346638528860e+305);

include_once 'PEAR.php';

/**
* The PHPMath_ProbabilityDistribution_General superclass 
* provides an object for encapsulating probability distributions.
* @version 1.0
* @author Jaco van Kooten
* @author Paul Meagher
* @author Jesus Castagnetto
*/

class PHPMath_ProbabilityDistribution_General {

  /**
  * Constructs a probability distribution.
  */
  function PHPMath_ProbabilityDistribution_General() {}

  /**
  * Probability density function.
  * @return the probability that a stochastic variable x 
  * has the value X, i.e. P(x=X).
  */
  function PDF($X) {}

  /**
  * Cumulative distribution function.
  * @return the probability that a stochastic 
  * variable x is less then X, i.e. P(x<X).
  */
  function CDF($X) {}

  /**
    * Inverse of the cumulative distribution function.
  * @return the value X for which P(x<X).
  */        
  function InverseCDF($probability) {}

  /**
    * Inverse of the cumulative distribution function.
  * @return the value X for which P(x<X).
  */        
  function RNG($num_vals) {}

  /**
  * Check if the range of the argument of the distribution 
  * method is between <code>lo</code> and <code>hi</code>.
  * @exception OutOfRangeException If the argument is out of range.
  */
  function checkRange($x, $lo=0.0, $hi=1.0) {
    if (($x < $lo) || ($x > $hi)) {
      return PEAR::raiseError("The argument of the distribution method 
           should be between $lo and $hi.");
    }
  }

  /**
  * Get the factorial of the argument
  * @return factorial of n.
  */ 
  function getFactorial($n) {
    return $n <= 1 ? 1 : $n * $this->getFactorial($n-1);
  }
    
  /**
  * This method approximates the value of X for which P(x<X)=<I>prob</I>.
  * It applies a combination of a Newton-Raphson procedure and bisection 
  * method with the value <I>guess</I> as a starting point. Furthermore, 
  * to ensure convergency and stability, one should supply an interval 
  * [<I>xLo</I>,<I>xHi</I>] in which the probability distribution reaches 
  * the value <I>prob</I>. The method does no checking, it will produce
  * bad results if wrong values for the parameters are supplied - use it 
  * with care.
  */    
  function findRoot($prob, $guess, $xLo, $xHi) {                    
    $accuracy     = 1.0e-10;
    $maxIteration = 150;
    $x     = $guess;
    $xNew  = $guess;
    $error = 0.0;
    $pdf   = 0.0; 
    $dx    = 1000.0;
    $i     = 0;    
    while ( (abs($dx) > $accuracy) && ($i++ < $maxIteration) ) {
      // Apply Newton-Raphson step
      $error = $this->CDF($x) - $prob;
      if($error < 0.0) {
        $xLo = $x;
      } else {
        $xHi = $x;
      }
      $pdf = $this->PDF($x);
      // Avoid division by zero      
      if ($pdf != 0.0) { 
        $dx   = $error / $pdf;
        $xNew = $x - $dx;
      }

      // If the NR fails to converge (which for example may be the
      // case if the initial guess is to rough) we apply a bisection
      // step to determine a more narrow interval around the root.
      if ( ($xNew < $xLo) || ($xNew > $xHi) || ($pdf==0.0) ) {
        $xNew = ($xLo + $xHi) / 2.0;
        $dx   = $xNew - $x;
      }
      $x = $xNew;
    }
    return $x;
  }  
    
}
?>

It should be noted that I modified the JSci API to use common textbook abbreviations for accessing the core distribution functions (such as, PDF(), CDF(), InverseCDF(), RNG()). I also enforce the idea that all classes should instantiate a Random Number Generating (RNG) method. RNG methods in particular are generally not as easy to implement as the other methods and may be one reason they were not included in the initial implementation of the Probability distribution superclass.

I have also provisionally added a PHPMATH_MAX_FLOAT constant and added a getFactorial utility method. A more mature PHP Math library might include these generally useful constants and methods in separate files so that they could be included in a wider range of math classes.


And now, Exponential distribution

When probability modeling, you will not be using the probability distributions superclass directly. Instead, you will be interacting with specific instantiations of the probability distributions superclass, one for each common probability distribution for which analytic work has been done to figure out how to implement the PDF, CDF, InverseCDF, and RNG methods.

R, for example, includes such methods for all of these univariate distributions:

  • Beta
  • Binomial
  • Cauchy
  • Chi Square
  • Exponential
  • F
  • Gamma
  • Geometric
  • Hypergeometric
  • Log-Normal
  • Logistic
  • Negative Binomial
  • Poisson
  • Student's t
  • Uniform
  • Weibull
  • Wilcoxon

As you can see, when it comes to fitting the probability distribution for your random variable to a specific theoretical probability distribution, you can choose from many such distributions. For you to become proficient in basic probability modeling, you need to become familiar with:

  1. The visual representation of such distributions
  2. The shape and location adjustment parameters that most probability distributions accept
  3. Which adjustment parameters are used for model fitting

Statistics textbooks and Web sites can be consulted for these details. See Resources.

I want to focus on one particular probability distribution called the exponential distribution. What you learn from this example can be used to understand how to use other probability distributions to construct univariate probability models.

The exponential distribution in particular possesses these four advantages:

  • The implementation of the distribution function methods can be easily found in textbooks or online so that you can see where they came from. Many math books focus on using positive and negative exponential functions to model change in some variable (for example, balance) as a function of time (for example, term interest rate).
  • The exponential distribution can be used to construct basic probability models (such as assigning probabilities to various statements about a random variable) for the behavior of a large number of phenomenon (for example, random variables of interest). It is commonly used to model the distribution of waiting times -- either the time elapsed until some event occurs or the interval between each occurrence of an event. I will illustrate such an application in the next section.
  • Textbooks on applied probability modeling often focus on the Exponential and Poisson distributions in particular because together they have demonstrated general utility and can be used to construct more elaborate probability models. In other words, not all probability distributions are equivalent in terms of their usefulness in modeling a variety of real-world phenomenon. The Exponential and Poisson distributions in particular are extremely useful for applied, real-world probability modeling. The Uniform, Normal and Binomial probability distributions are three other distributions you should make an effort to study in detail.
  • When the observed distribution of your random variable is exponentially distributed, understanding the likelihood of different outcomes is much harder to do by intuition, so it helps in these cases especially to have a formal probability model for the random variable.

Start looking at the Exponential distribution by seeing some examples. The shape and location aspects of the Exponential distribution are determined by a single parameter called lambda (lambda, also known as the rate or decay parameter). Figure 6 plots the probability distributions for four different values of lambda.


Figure 6. Four lambda probability distributions
Four lambda probability distributions

Listing 4 is the code that implements the probability distribution methods for an Exponential distribution. Notice that the PDF method returns probability values for given values of X; this information was used to construct the lines in Figure 6. Also notice that the core distribution methods (PDF(), CDF(), InverseCDF(), and RNG()) return a single answer or a vector of answers depending on how the method is called. This feature will be demonstrated later in a script that tests the methods of this class.


Listing 4. Implementing probability distribution methods for an Exponential distribution
<?php

/**
* @package PHPMath_ProbabilityDistribution
*/

require_once 'PHPMath/ProbabilityDistribution/General.php';

/**
* The PHPMath_ProbabilityDistribution_Exponential class provides  
* an object for encapsulating exponential distributions.
* @version 0.2
* @author Mark Hale
* @author Paul Meagher
*/

class PHPMath_ProbabilityDistribution_Exponential 
     extends PHPMath_ProbabilityDistribution_General {
  
  var $rate; 

  function PHPMath_ProbabilityDistribution_Exponential($decay=1) {
    if($decay < 0.0) {
      return PEAR::raiseError("Decay parameter should be positive.");
    }    
    $this->rate = $decay;
  }
   
  function Mean() {
    return $this->rate;
  }
  
  function Variance() {
    return $this->rate * $this->rate;
  }
  
  function PDF($x) {    
    if (is_array($x)) {      
      $pdf_vals = array();
      $num_vals = count($x);
      for($i=0; $i < $num_vals; $i++) {                
        if ($x[$i] < 0.0) {
          return PEAR::raiseError("Input values must be greater than 0.");
        }
        $pdf_vals[$i] = $this->rate * exp(-$this->rate * $x[$i]);
      }    
      return $pdf_vals;    
    } else {          
      if ($x < 0.0) { 
        return PEAR::raiseError("Input value must be greater than 0.");
      }
      $pdf_val = $this->rate * exp(-$this->rate * $x);
      return $pdf_val;
    }
  }
  
  function CDF($x) {
    if (is_array($x)) {
      $cdf_vals = array();
      $num_vals = count($x);
      for($i=0; $i < $num_vals; $i++) {                
        if ($x[$i] < 0.0) {
          return PEAR::raiseError("Input values must be greater than 0.");
        }
        $cdf_vals[$i] = 1.0 - exp(-$this->rate * $x[$i]);
      }
      return $cdf_vals;
    } else {          
      if ($x < 0.0) {
        return PEAR::raiseError("Input value must be greater than 0.");
      }
      $cdf_val = 1.0 - exp(-$this->rate * $x);
      return $cdf_val;
    }
  }
  
  function InverseCDF($prob) {
    if (is_array($prob)) {
      $inv_vals = array();
      $num_vals = count($prob);
      for($i=0; $i < $num_vals; $i++) {                
        if ($this->checkRange($prob[$i])) {
          return PEAR::raiseError("Probability values must be 
               between 0.0 and 1.0");
        }
        $inv_vals[$i] = -log(1.0 - $prob[$i]) / $this->rate;
      }
      return $inv_vals;
    } else {
      if ($this->checkRange($prob)) {
        return PEAR::raiseError("Probability value must be 
             between 0.0 and 1.0");
      }
      $inv_val = -log(1.0 - $prob) / $this->rate;
      return $inv_val;
    }  
  }

  function RNG($num_vals=1) {
    if ($num_vals < 1) {
      return PEAR::raiseError("Number of random values must be 
           greater than 0");
    }
    if ($num_vals == 1) {
      $rand_val = mt_rand() / mt_getrandmax();
      return -log($rand_val) / $this->rate;
    } else {
      $rand_vals = array();
      for($i=0; $i < $num_vals; $i++) {
        $rand_val = mt_rand() / mt_getrandmax();
        $rand_vals[$i] = -log($rand_val) / $this->rate;
      }
      return $rand_vals;
    }
  }
    
}

?>

To gain insight into how this class is used and what it does, I have created a script in the tests directory of the accompanying PHPMath_ProbabilityDistribution package that demonstrates usage and method-return values. The script is called exponential.php and looks like this:


Listing 5. The exponential.php script demonstrates usage and method-return values
<?php 
// exponential.php 
// Script to test ExponentialDistribution.php methods 

require_once "../ExponentialDistribution.php"; 
require_once "make_table.php"; 

$exp = new ExponentialDistribution(1); 

$Methods    = array("Mean()","Variance()","PDF(2)","CDF(2)",
     "InverseCDF(0.95)", "RNG(1)"); 
$Output[0] = $exp->Mean(); 
$Output[1] = $exp->Variance(); 
$Output[2] = $exp->PDF(2); 
$Output[3] = $exp->CDF(3); 
$Output[4] = $exp->InverseCDF(0.95); 
$Output[5] = $exp->RNG(1); 
make_table("Exponential Distribution (lambda=1)", "Methods", "Output", 
     $Methods, $Output); 

// Test PDF function by feeding an array of $x_vals and 
// getting a corresponding array of $p_vals. 
$X_Vals = array(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5); 
$P_Vals = $exp->PDF($X_Vals); 
make_table("PDF(X_VALS)", "X Vals Input", "P Vals Output", $X_Vals, 
     $P_Vals); 

// Test CDF function by feeding an array of $x_vals and getting 
// a corresponding array of $p_vals where each p_val corresponds 
// to p(x < $x_vals[$i]) 
$X_Vals = array(0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5); 
$P_Vals = $exp->CDF($X_Vals); 
make_table("Vector Test of CDF(X_VALS)", "X Vals Input", "P Vals Output", 
     $X_Vals, $P_Vals); 

// Test InverseCDF function by feeding in P_Vals from previous 
// test. Result should be mirror of CDF output. 
$X_Vals = $exp->InverseCDF($P_Vals); 
make_table("InverseCDF(P_VALS)", "P Vals Input", "X Vals Output", 
     $P_Vals, $X_Vals); 

// Test RNG function by passing the number of values you want generated. 
// Result is an array of random numbers from an exponential distribution 
$Counter = range(0, 7); 
$Rnd_Vals = $exp->RNG(8); 
make_table("RNG(N_VALS)", "Counter", "Rnd Vals", $Counter, $Rnd_Vals); 
?> 

The exponential.php test script generates five HTML tables.

Table 2 demonstrates the single values that all the methods in the Exponential.php class can return. You can verify that these methods are producing accurate results by comparing their output to what a more mature statistical computing environment like R would generate (for an example, revisit the dexp(), pexp(), qexp(), and rexp() functions).


Table 2. Single values all methods in Exponential.php returns
Exponential Distribution (lambda=1)
MethodsOutput
Mean()1
Variance()1
PDF(2)0.13533528323661
CDF(2)0.95021293163214
InverseCDF(0.95)2.995732273554
RNG(1)2.4945071670198

You can also use these tables to assist you in understanding the types of values the functions return when passed different arguments. Tables 3, 4, and 5 are particularly useful for this. They illustrate the vector orientation of these methods and consequently allow you to see the behavior of the three major distribution functions (PDF, CDF, and InverseCDF) over a range of input values (when lambda is set to 1).


Table 3. Lambda = 1, vector orientation of PDF function
PDF(X_VALS)
X Vals InputP Vals Output
01
0.50.60653065971263
10.36787944117144
1.50.22313016014843
20.13533528323661
2.50.082084998623899
30.049787068367864
3.50.030197383422319

Table 4. Lambda = 1, vector orientation of CDF function
CDF(X_VALS)
X Vals InputP Vals Output
00
0.50.39346934028737
10.63212055882856
1.50.77686983985157
20.86466471676339
2.50.9179150013761
30.95021293163214
3.50.96980261657768

Table 5. Lambda = 1, vector orientation of InverseCDF function
InverseCDF(P_VALS)
P Vals InputX Vals Output
00
0.393469340287370.5
0.632120558828561
0.776869839851571.5
0.864664716763392
0.91791500137612.5
0.950212931632143
0.969802616577683.5

Table 6 displays output from the random-number generator. The RNG() method generates exponentially distributed random numbers using the value of lambda to adjust the shape and location aspects of the exponential distribution you are sampling from.


Table 6. Output from the random-number generator
RNG(N_VALS)
CounterRnd Vals
01.658002921542
12.0097168931621
20.41562004896617
31.9121062991322
40.21394501966731
50.8054845401079
60.50914083865687
71.2056725040028

Building a probability model

Now that you understand the mechanics of how to use the Exponential distribution class, you're ready to see how it can be used to gain insight into a real-world problem.

And this will be fun problem -- to develop a model for when and how many goals are likely to occur in World Cup soccer games. What I want to focus on here is showing you how to use the Exponential distribution class I just discussed to derive some of the results reported in this article.

As you know, the Exponential distribution accepts a rate parameter, also referred to as lambda. To use the Exponential distribution to develop a probability model for World Cup soccer goals, you need to be able to derive this rate parameter from your measurements.

A rate is defined as the number of occurrences of some phenomenon over a unit of time or space. The rate of soccer goals in World Cup tournaments between 31-May-2002 and 30-June-2002 is equal to 575/232. This can be thought of as the mean number of goals scored in a 90-minute regulation game. It was computed as follows:

average goal rate = Total Number of Goals / Total Number of Games

In PHP, you represent this concept by setting up variables called $num_goals and $num_games that are placeholders for the evolving quantities that are needed to compute the goal rate. The code snippet that follows shows a fragment of the PHP-based probability model for World Cup soccer goals.

<?php

// world_cup_soccer.php

/**
* @package PHPMath_ProbabilityDistribution
*/

require_once 'PHPMath/ProbabilityDistribution/Exponential.php'; 

$num_goals = 575; 
$num_games = 232;
$rate = $num_goals / $num_games;
echo "Value of rate parameter = $rate<br>";
$exp = new PHPMath_ProbabilityDistribution_Exponential($rate); 

?>

This fragment produces the following output:

     Value of rate parameter = 2.4784482758621

One question you might be curious about is the probability that a goal will be scored in the first 10 minutes of a soccer game. I am going to ignore some prior modeling steps in which you would have developed code to plot inter-goal intervals and to test whether the exponential distribution is the best fitting distribution. Instead, I will assume that this code has been developed in a separate script. Therefore, I'll proceed to a stage where you would use the theoretical exponential distribution to calculate some probabilities of interest. The next code snippet, for example, is added to the previous bit of code and used to compute the probability of a goal in the first 10 minutes of play.

<?php

$mins = 10;
$mins_per_game = 90;
$interval = $mins / $mins_per_game;
echo "Probability of goal in t <= 10 mins: ". $exp->CDF($interval);

?>

The output this fragment generates is:

     Probability of goal in t <= 10 mins: 0.24071884483247

In other words, in 24 percent of games a goal is scored in the first 10 minutes of play. In the next 100 games that are played, you can expect that in 24 of those games, a goal will be scored in the first 10 minutes.

You might also be interested in the inverse question: In P percent of games, a goal occurs within how minutes of play? You can answer this question with three different P values using the following code snippet:

<?php

$prob = 0.25;
$num_mins = $exp->InverseCDF($prob) * $mins_per_game;
echo "In 25 percent of games, a goal will occur within $num_mins 
     minutes of play.<br>";

$prob = 0.50;
$num_mins = $exp->InverseCDF($prob) * $mins_per_game;
echo "In 50 percent of games, a goal will occur within $num_mins 
     minutes of play.<br>";

$prob = 0.75;
$num_mins = $exp->InverseCDF($prob) * $mins_per_game;
echo "In 75 percent of games, a goal will occur within $num_mins 
     minutes of play.<br>";

?>

The output this fragment generates is:

In 25 percent of games, a goal will occur within 10.446611604858 
     minutes of play.
In 50 percent of games, a goal will occur within 25.170283704507 
     minutes of play.
In 75 percent of games, a goal will occur within 50.340567409014 
     minutes of play.

You might also be interested in understanding the issue of how likely it is that X number of goals are scored in a game. The easiest way to compute this probability is to use a mathematically related distribution called the Poisson distribution which is useful for obtaining answers to such discrete counting problems.

I have also implemented a PHP-based version of the Poisson distribution functions. The following code fragment shows how to use the Poisson distribution to compute the probability of scoring various numbers of goals in a game.

<?php

require_once 'PHPMath/ProbabilityDistribution/Poisson.php'; 
$lambda = $rate;
$pois = new PHPMath_ProbabilityDistribution_Poisson($lambda); 
echo "Probability that goal count = 0: ". $pois->PDF(0) ."<br>";
echo "Probability that goal count = 1: ". $pois->PDF(1) ."<br>";
echo "Probability that goal count = 2: ". $pois->PDF(2) ."<br>";
echo "Probability that goal count = 3: ". $pois->PDF(3) ."<br>";
echo "Probability that goal count = 4: ". $pois->PDF(4) ."<br>";
echo "Probability that goal count = 5: ". $pois->PDF(5) ."<br>";
echo "Probability that goal count = 6: ". $pois->PDF(6) ."<br>";

?>

The output this fragment generates is:

Probability that goal count = 0: 0.083873272849375
Probability that goal count = 1: 0.20787556848444
Probability that goal count = 2: 0.25760442215206
Probability that goal count = 3: 0.2128197453124
Probability that goal count = 4: 0.13186568270973
Probability that goal count = 5: 0.065364454791462
Probability that goal count = 6: 0.027000403380094

The Poisson distribution differs from the Exponential distribution in that Poisson is for modeling discrete random variables and Exponential is for modeling continuous random variables. I used Exponential to calculate waiting-time probabilities because inter-arrival time is a continuous random variable and this distribution is often a good probability distribution to consider using to account for the distribution of waiting times.

The Poisson distribution is for modeling discrete random variables involving a counting process (such as, the number of times a certain event occurs in some period of time). A count falls into a discrete list of values from 0 to some upper bound. In the case of World cup soccer, the Poisson distribution can be used to compute the probability of a game ending with different goal counts.

Space precludes a more complete discussion of this important probability distribution; however, I hope that this brief discussion has raised your awareness of the distinction between discrete and continuous random variables and probability distributions and how different probability distributions can be applied to data to construct a more detailed probability model that answers different types of questions.


Some thoughts on probability modeling

I'd like to cover three other topics in probability modeling before I end this article. They concern the following issues:

  • Is there such a thing as too much when it comes to fitting data?
  • How much can the determination of which distribution and what adjustment parameters to use be automated?
  • A potential new path to developing a Probability Distributions Library implemented in PHP.

Over-fitting data

One question you might ask is why bother fitting an empirical data distribution to a theoretical probability distribution. Wouldn't it be possible to use the empirical data distribution to directly compute the probabilities of certain outcomes? After all, a relative frequency histogram can easily be converted to a probability histogram. You could also develop a program that would compute the probability of observing a male taller than 72 inches by counting the number of data points with a value greater than 72 inches and dividing this number by the total number of observations you have to work with. Wouldn't it be better to use an actual empirical probability distribution rather than a less accurate theoretical probability distribution to construct your probability models?

In some cases, this is the correct route. Software can be written that allows you to make inferences about the probability of certain outcomes using empirical probability distributions with irregular shapes. These inferences may be more accurate than using any of the available theoretical probability distributions because it is possible that none of them fit the distribution data well.

The main problem you run into, though, if use an empirical probability distribution, is over-fitting your data. The purpose of constructing a probability model is to generalize to new cases of your random variable. For example, if you accept that the real distribution of male height should always have a dip at 69 inches, do you think that this will be true for future cases? If one run of the random-number generator produces fewer sixes then sevens, should you slavishly predict that future runs will produce the same relative frequencies?

The argument against using empirical probability distributions is that they tend to over-fit the data and can reduce your ability to generalize to new instances of your random variable. This argument should also be kept in mind when one resorts to more advanced techniques, such as curve fitting with Fourier components, to represent the irregularly shaped probability distributions. While the curve might conform more closely to the probability distribution, is it trying to conform too closely? What happens when you take your next sample? Do you need to redraw the curve?

In univariate probability modeling (versus a curve-fitting approach), you frame your investigation in terms of having an observed distribution and wanting to use this information to estimate the simplest possible probability model for the data. The argument against curve fitting is that your models might be marginally more accurate but much more complicated than a simple univariate probability model. Also, your parameter estimates may be less robust as new information comes in -- they may not serve the purpose of generalization to new cases.

Automating the distribution choice

Another issue I offer for pondering is whether to automate the process of finding the appropriate theoretical distribution and estimating which adjustment parameters to use. Tools are available that allow you to feed in a vector of measurements representing your random variable -- these tools automatically:

  • Generate parameter estimates for a variety of theoretical distribution parameters (using the method of moments in most cases)
  • Do goodness-of-fit testing to rule out certain probability distributions as candidates
  • Rank the remaining theoretical distributions

While such tools are definitely useful, they are not a substitute for an intelligent analyst bringing experience, knowledge and theory to bear on the issue, performing exploratory data analysis, applying various goodness-of-fit tests to the data and making an all-things-considered judgment. In some cases, for theoretical or rational reasons, you might expect a random variable to be distributed in a certain manner. For example, a skewed normal distribution might suggest a mixture of two underlying normal distributions if you mixed male and female heights together to produce the height distribution. Also, a random variable that can be fit to a distribution using one adjustment parameter might be preferable, on the grounds of simplicity, to a theoretical distribution that requires two adjustment parameters, especially if the model using a single adjustment parameter makes empirical sense.

It would be interesting and useful to develop a PHP-based tool that would help automate the mechanical aspects of fitting data to various theoretical probability distributions. Such a tool would be designed (and assigned) as a useful exploratory tool -- a decision-making aid rather than a substitute for human insight and common sense.

A new PHP PDL?

A promising development is the recent announcement of a statistics-processing extension for PHP that wraps two libraries -- DCDFLIB (a library of C routines for CDFs, Inverses, and other parameters) and RANDLIB. This opens up the possibility of having a Probability Distributions Library implemented in PHP that is a wrapper around these tried and tested probability extensions. One advantage to using the API discussed in this article: It is vector-based and object-oriented while the proposed C-based extension is not.

The following code snippet illustrates how easy it is to integrate these new functions into the suggested PHP-based API (note the call to the stats_dens_normal() function):

<?php 

class Normal { 
   
 var $mean; 
 var $stdev; 
     
 function Normal($mu, $sigma) { 
  $this->mean  = $mu; 
  $this->stdev = $sigma; 
 } 
   
 function PDF($x) {     
  if (is_array($x)) {       
   $pdf_vals = array(); 
   $num_vals = count($x); 
   for($i=0; $i < $num_vals; $i++) {                 
    if ($x[$i] < 0.0) { 
      return PEAR::raiseError("Input values must be greater than 0."); 
    } 
    $pdf_vals[$i] = stats_dens_normal($x[$i], $this->mean, $this->stdev);
   }     
   return $pdf_vals;     
  } else {           
   if ($x < 0.0) { 
    return PEAR::raiseError("Input value must be greater than 0."); 
   } 
   $pdf_val = stats_dens_normal($x, $this->mean, $this->stdev);      
   return $pdf_val; 
  } 
 } 

} 

?> 

Another advantage of using a PHP-based API involves providing a uniform API to distribution functions that are implemented in C, PHP, or another language. If the DCDFLIB or RANDLIB libraries do not implement a particular distribution or method, you could implement the relevant distribution or method in PHP and the user would (and arguably should) be oblivious to these details. The API is of paramount importance to the user; the details of how it is implemented are of less importance.


If it don't fit, don't force it

In this article, I discussed univariate probability modeling techniques, including fitting data to a theoretical probability distribution and using the fitted theoretical probability distribution to assign probabilities to various outcomes. I provided a sample model (of soccer goals) to demonstrate the efficacy of this basic form of probability modeling and served up a tool to help construct probability models, a Probability Distributions Library. And you started building the foundation for understanding and developing more complex forms of probability models.

In conclusion, the following are some random variables for which you might want to construct probability models as practice:

  • The time interval between customer orders
  • The number of customer orders in a given week
  • The number of people purchasing a particular product in a given week
  • The time interval between purchases of a particular product
  • The number of visitors to your Web site at any given minute, hour, or day

The Exponential and Poisson distributions included in the PHPMath_ProbabilityDistribution package are particularly suitable for investigating these random variables further.



Download

NameSizeDownload method
wa-probab/ProbabilityDistribution.tar.gzHTTP

Information about download methods


Resources

  • Updates to the probability distribution classes will be available at PHPMath.com. You can download the source for this article.

  • Get the details on the wrapper (and a download of the wrapper) for the two scientific libraries mentioned in this article -- DCDFLIB and RANDLIB.

  • Read "Using Soccer Goals to Motivate the Poisson Process" by Singfat Chu for more on the context and analysis of the problem posed in this article -- developing a model for numbers and time-based frequencies of goals scored in World Cup soccer matches.

  • Explore the author's source of the male height data, "Fitting Percentage of Body Fat to Simple Body Measurements by Roger W. Johnson. This interesting article on the use of multiple regression is from the Journal of Statistics Education.

  • Read about the four lambdas figure (Figure 6) from this article. The figure is courtesy of a fine article, "Distributions" by Paul Bourke which details the following distribution types -- Gaussian (Normal), Poisson, Gamma, Exponential, Rayleigh, and Rice.

  • Read "Make your software behave: Playing the numbers" to discover how to get hardware and software to create a truly random number (quite a task, actually, since they're designed not to produce random results) (developerWorks, April 2000).

  • In "Server clinic: R handy for crunching data," get the details on how to use the sophisticated, open source R software for managing statistical calculations (developerWorks, July 2003).

  • Read "Take Web data analysis to the next level with PHP" for a thorough introduction to Chi Square analysis and its impact on multi-level analysis of Web data (developerWorks, August 2003).

  • Check out the open source R Project, the central repository on information about R, a language and environment for statistical computing and graphics, similar to the S language.

  • Explore the PEAR Group (PHP Extension and Application Repository) site. PEAR Group is dedicated to providing a structured library of open source code for PHP users; a system for code distribution and package maintenance; a standard style for code written in PHP; the Foundation Classes (PFC) and Extension Code Library (PECL); and a Web site, mailing lists, and download mirrors to support the PHP/PEAR community.

  • Visit the JSci site, the repository for information on JSci. JSci is a set of open source Java packages designed to encapsulate scientific methods and principles in the most natural way possible, to aid in the development of scientific-based software.

  • Visit the IBM General Decimal Arithmetic portal, an ideal place to start to learn how computers represent numbers.

  • Check out the SpamBayes Project which is developing a mostly-client-side, Bayesian anti-spam filter with core code that is a message classifier written in Python.

  • Try this helpful book for learning probability concepts and algorithms, Concepts in Probability and Stochastic Modeling by James J. Higgins and Sallie Keller-McNulty (Duxbury, 1995).

About the author

Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in cognitive science and has spent the last six years developing Web applications. His current interests include statistical computing, data mining, content management, and e-learning.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Open source
ArticleID=11848
ArticleTitle=Apply probability models to Web data using PHP
publish-date=10072003