In contrast with other open source languages like Perl and Python, PHP lacks a robust community effort to develop a math library.
One reason for this state of affairs may the abundance of existing sophisticated math tools that might dwarf any home-grown PHP effort. For example, a powerful tool that I've been researching, the S System, has an impressive array of statistical libraries designed for analyzing datasets and was awarded an ACM Award in 1998 for its language design. If S or its opensource cousin R is merely an exec_shell call away, why bother implementing the same statistical computing functionality in PHP? For more on the S System, its ACM Award, or R, see Resources.
After all, isn't this a waste of developer effort? If the motivations for developing a PHP math library are bounded by concerns for conserving developer effort and using the best tool for the job, then PHP's present course makes sense.
On the other hand, educational motivations might inspire the development of a PHP math library. For about 10 percent of the population, mathematics is an interesting topic to explore. For those who are also fluent in PHP, the development of a PHP math library can reinforce the mathematics learning process -- in other words, don't just read a chapter on T tests, implement a class that computes the relevant intermediate values and displays them in a standard format.
With guidance and coaching, I hope to demonstrate that developing PHP math libraries is not a difficult task and may represent an interesting technical and learning challenge. In this article, I'll provide an example of a PHP math library called SimpleLinearRegression that demonstrates a general approach that can be used to develop PHP math libraries. Let's begin by discussing some general principles that guided my development of this SimpleLinearRegression class.
I used six general principles to guide the development of the SimpleLinearRegression class.
- Establish one class per analytical model.
- Employ backward chaining to develop the class.
- Expect an abundance of getters.
- Store intermediate results.
- Develop a preference for a verbose API.
- Perfection is not the goal.
Let's look at each of these guidelines in more detail.
Establish one class per analytical model
Each major type of analytical test or procedure should have a PHP class by the same name that contains input functions, functions that compute intermediate and summary values, and output functions (functions that dump the intermediate and summary values to the screen in textual or graphic form).
Employ backward chaining to develop the class
In mathematical programming, the goal of coding is often the standard output values that the analytical procedure (such as MultipleRegression, TimeSeries, or ChiSquared) is expected to generate. From a problem-solving point of view, this means that you can employ backward chaining to develop the methods of a mathematical class.
For example, the summary output screen displays one or more summary statistics. These summary statistics depend upon computing intermediate statistics, which may in turn involve further intermediate statistics, and so on. This backward-chaining-based development methodology leads to the next principle.
Expect an abundance of getters
The majority of the class-development work for a mathematical class involves computing intermediate and summary values. Practically, this means that you should not be suprised if your class contains many getter methods that compute intermediate and summary values.
Store the results of intermediate calculations inside a result object so you can use intermediate results as input to subsequent calculations. This is a principle that is enforced in design of the S language. In the present context it is enforced by selection of instance variables to represent computed intermediate and summary results.
Develop a preference for a verbose API
When developing the naming scheme for member functions and instance variables in my SimpleLinearRegression class, I found it easier to keep track of what my functions did and what my variables stood for when I used longer names to describe them (names like getSumSquaredError instead of getYY2).
I did not totally give up on abbreviating the names; however, when I did abbreviate, I tried to provide comments to fully elaborate the meaning of the name. Highly abbreviated naming schemes are too common in mathematical programming in my opinion -- they make it more difficult than necessary to understand and verify that a mathematical routine works as it should.
The goal of this coding exercise is not necessarily to develop a highly optimized and rigorous math engine for PHP. In these early stages, the learning and challenge aspects of implementing significant analytical tests should be emphasized.
When modeling a statistical test or procedure, you need to figure out which instance variables to declare.
The selection of instance variables can be determined by noting the intermediate and summary values to be generated by the analytic procedure. Each intermediate and summary value can have a corresponding instance variable to hold its value as an object attribute.
I conducted such an analysis to determine which variables to declare for the SimpleLinearRegression class in Listing 1. A similiar analysis would be performed for a MultipleRegression, ANOVA, or TimeSeries
procedure.
Listing 1. Instance variables for the SimpleLinearRegression class
<?php
// Copyright 2003, Paul Meagher
// Distributed under GPL
class SimpleLinearRegression {
var $n;
var $X = array();
var $Y = array();
var $ConfInt;
var $Alpha;
var $XMean;
var $YMean;
var $SumXX;
var $SumXY;
var $SumYY;
var $Slope;
var $YInt;
var $PredictedY = array();
var $Error = array();
var $SquaredError = array();
var $TotalError;
var $SumError;
var $SumSquaredError;
var $ErrorVariance;
var $StdErr;
var $SlopeStdErr;
var $SlopeVal; // T value of Slope
var $YIntStdErr;
var $YIntTVal; // T value for Y Intercept
var $R;
var $RSquared;
var $DF; // Degrees of Freedom
var $SlopeProb; // Probability of Slope Estimate
var $YIntProb; // Probability of Y Intercept Estimate
var $AlphaTVal; // T Value for given alpha setting
var $ConfIntOfSlope;
var $RPath = "/usr/local/bin/R"; // Your path here
var $format = "%01.2f"; // Used for formatting output
}
?>
|
The constructor method for the SimpleLinearRegression class accepts an X and a Y vector with the same number of values in each vector. You can also set a confidence interval for your predicted Y values (default is a 95 percent confidence interval).
The constructor method begins by verifying that the data is in a form suitable for processing. Once the input vectors have passed the "equal size" and "size greater than 1" tests, the heart of the algorithm is executed.
Performing this task involves computing intermediate and summary values for the statistical procedure through a series of getter methods. The return value from each method call is assigned to an instance variable for the class. Storing calculational results in this way ensures that the intermediate and summary values can be used by the calling routine in chained calculations. The results can also be displayed by calling the output methods for the class, as is described in Listing 2.
Listing 2. Calling class output methods
<?php
// Copyright 2003, Paul Meagher
// Distributed under GPL
function SimpleLinearRegression($X, $Y, $ConfidenceInterval="95") {
$numX = count($X);
$numY = count($Y);
if ($numX != $numY) {
die("Error: Size of X and Y vectors must be the same.");
}
if ($numX <= 1) {
die("Error: Size of input array must be at least 2.");
}
$this->n = $numX;
$this->X = $X;
$this->Y = $Y;
$this->ConfInt = $ConfidenceInterval;
$this->Alpha = (1 + ($this->ConfInt / 100) ) / 2;
$this->XMean = $this->getMean($this->X);
$this->YMean = $this->getMean($this->Y);
$this->SumXX = $this->getSumXX();
$this->SumYY = $this->getSumYY();
$this->SumXY = $this->getSumXY();
$this->Slope = $this->getSlope();
$this->YInt = $this->getYInt();
$this->PredictedY = $this->getPredictedY();
$this->Error = $this->getError();
$this->SquaredError = $this->getSquaredError();
$this->SumError = $this->getSumError();
$this->TotalError = $this->getTotalError();
$this->SumSquaredError = $this->getSumSquaredError();
$this->ErrorVariance = $this->getErrorVariance();
$this->StdErr = $this->getStdErr();
$this->SlopeStdErr = $this->getSlopeStdErr();
$this->YIntStdErr = $this->getYIntStdErr();
$this->SlopeTVal = $this->getSlopeTVal();
$this->YIntTVal = $this->getYIntTVal();
$this->R = $this->getR();
$this->RSquared = $this->getRSquared();
$this->DF = $this->getDF();
$this->SlopeProb = $this->getStudentProb($this->SlopeTVal, $this->DF);
$this->YIntProb = $this->getStudentProb($this->YIntTVal, $this->DF);
$this->AlphaTVal = $this->getInverseStudentProb($this->Alpha, $this->DF);
$this->ConfIntOfSlope = $this->getConfIntOfSlope();
return true;
}
?>
|
The method names and their sequence were derived by a combination of backward chaining and consulting an undergraduate statistics textbook that provided step-by-step instructions for computing intermediate values. The names of the intermediate values that I needed to compute were prefixed with "get" to derive the method name.
The SimpleLinearRegression procedure is used to fit a straight line to the data in which the straight line has the following standard form:
y = b + mx
The PHP form of this equation would look something like Listing 3:
Listing 3. PHP equation that fits the model to the data
$PredictedY[$i] = $YIntercept + $Slope * $X[$i] |
The SimpleLinearRegression class uses a least-squares criterion for deriving estimates of what the Y Intercept and Slope parameters should be. These estimated parameters are used to construct a linear equation (see Listing 3) to model the relationship between the X and Y values.
Using the derived linear equation, you can then obtain predicted Y values for each X value. If the linear equation is a good fit to the data, then the observed and predicted Y values tend to agree.
The SimpleLinearRegression class generates a fairly large number of summary values. One important summary value is a T statistic that can be used to measure how well a linear equation fits the data. If the fit is good, then the T statistic tends to have a large value. If the T statistic is small, the linear equation should be replaced by a model that assumes the mean of the Y values is the best predictor (that is, the mean of a set of values is often a useful predictor of the next observed value making it the default model).
To test whether the T statistic is large enough to reject the mean of the Y values as the best predictor, you need to compute the probability of obtaining the T statistic by chance. If the probability of obtaining a T statistic is low, then you can reject the null hypothesis that the mean is the best predictor and, correspondingly, gain confidence that a simple linear model offers a good fit for the data.
So, how do you compute the probability of the T statistic?
Compute the T statistic probability
Because PHP lacks mathematical routines to compute the probability of a T statistic, I decided to shell out to the statistical computing package R (see www.r-project.org in Resources) to obtain the necessary values. I also wanted to raise awareness about this package because:
- R provides quite a few ideas PHP developers might want to emulate in a PHP math library
- With R, you can confirm that values obtained from a PHP math library agree with those obtained from a mature, freely available, open source statistical package.
The code in Listing 4 demonstrates just how easy it is to shell out to R for one value.
Listing 4. Shell out to the R statistical computing package for one value
<?php
// Copyright 2003, Paul Meagher
// Distributed under GPL
class SimpleLinearRegression {
var $RPath = "/usr/local/bin/R"; // Your path here
function getStudentProb($T, $df) {
$Probability = 0.0;
$cmd = "echo 'dt($T, $df)' | $this->RPath --slave";
$result = shell_exec($cmd);
list($LineNumber, $Probability) = explode(" ", trim($result));
return $Probability;
}
function getInverseStudentProb($alpha, $df) {
$InverseProbability = 0.0;
$cmd = "echo 'qt($alpha, $df)' | $this->RPath --slave";
$result = shell_exec($cmd);
list($LineNumber, $InverseProbability) = explode(" ", trim($result));
return $InverseProbability;
}
}
?> |
Note that the path to the R executable is set and used in the two functions. The first function returns a probability value associated with a T statistic based upon the Students T distribution, while the second inverse function computes the T statistic corresponding to a given alpha setting. The getStudentProb method is used to assess the fit of the linear model; the getInverseStudentProb method returns an intermediate value used to compute a confidence interval for each predicted Y value.
Space constraints keep me from going into detail about all the functions in this class, so I encourage you to consult an undergraduate statistics textbook if you want to understand the termininology and steps involved in a Simple Linear Regression analysis.
To demonstrate how the class is used, I can use data from a study on burnout in human services. Michael Leiter and Kimberly Ann Meechan studied the relationship between a measure of burnout called the Exhaustion Index and an independent variable they called Concentration. Concentration refers to the proportion of a person's social contacts that come from their work environment.
To examine the relationship between Exhaustion Index scores and Concentration scores for individuals from their sample, load these scores into appropriately named arrays and instantiate the class with these array values. After instantiating the class, display some of the summary values generated by the class to assess the degree to which a linear model fits the data.
Listing 5 shows a script that loads the data and displays summary values:
Listing 5. A script for loading data and displaying summary values
<?php
// BurnoutStudy.php
// Copyright 2003, Paul Meagher
// Distributed under GPL
include "SimpleLinearRegression.php";
// Load data from burnout study
$Concentration = array(20,60,38,88,79,87,
68,12,35,70,80,92,
77,86,83,79,75,81,
75,77,77,77,17,85,96);
$ExhaustionIndex = array(100,525,300,980,310,900,
410,296,120,501,920,810,
506,493,892,527,600,855,
709,791,718,684,141,400,970);
$slr = new SimpleLinearRegression($Concentration, $ExhaustionIndex);
$YInt = sprintf($slr->format, $slr->YInt);
$Slope = sprintf($slr->format, $slr->Slope);
$SlopeTVal = sprintf($slr->format, $slr->SlopeTVal);
$SlopeProb = sprintf("%01.6f", $slr->SlopeProb);
?>
<table border='1' cellpadding='5'>
<tr>
<th align='right'>Equation:</th>
<td></td>
</tr>
<tr>
<th align='right'>T:</th>
<td></td>
</tr>
<tr>
<th align='right'>Prob > T:</th>
<td><td>
</tr>
</table>
|
Run this script through a Web browser to produce the following output:
| Equation: | Exhaustion = -29.50 + (8.87 * Concentration) |
| T: | 6.03 |
| Prob > T: | 0.000005 |
The bottom row of this table tells that the probability of obtaining a T value this large by chance is very low. You can conclude that a simple linear model offers more predictive power than simply using the mean of the Exhaustion scores.
Knowing the concentration of workplace contacts for an individual can be used to predict the degree of burnout they are likely to be experiencing. The equation tells us that for each one-unit increase in concentration scores, a person in the social services field experiences an eight-unit increase in Exhaustion scores. This provides some evidence that to reduce the potential for burnout, individuals in the social services field should consider making friends outside of their workplace.
This is a quick sketch of what these results might mean. To fully explore the implications of this dataset, you would want to study the data in more detail to make sure this is the proper interpretation. In the next article I discuss what additional analyses should be performed.
For one, you don't have to be a rocket scientist to develop a significant PHP-based math package. Adhering to standard object-oriented techniques, as well as explicitly adopting a backward-chaining problem-solving approach, makes implementing some of the more basic statistical procedures using PHP relatively straightforward.
From an educational point of view, I would argue that this exercise is useful if for no other reason than it forces you to think at both a higher and lower level of abstraction about a statistical test or routine. In other words, a good way to complement your learning of a statistical test or procedure is to implement that procedure as an algorithm.
To implement a statistical test often requires going beyond the information given and engaging in creative problem solving and discovery. It is also a good way to discover holes in your knowledge of a subject.
On a negative note, you discovered that PHP has no native access to sampling distributions, which is a requirement for implementing most statistical tests. You shelled out to R to get these values, but I am worried that you will not have the time or inclination to install R. A native PHP implementation of some common probability functions would resolve that.
Another problem is that the class generates many intermediate and summary values, but the summary output really didn't take advantage of this. I provided some teaser output, but this was not sufficently extensive nor well organized so that you could adequately interpret the results of the analysis. In fact, I was entirely silent about how output methods might be integrated into this class. This needs to be addressed.
Finally, understanding your data means more than just looking at summary values. You also need to get a sense of how individual data points are distributed. One of the best ways to do this is by graphing your data. Again, I have been silent on this topic, but it needs to be addressed if the class is to be used for analysis of real data.
In the next article in this series, I'll implement some probability functions using native PHP code, extend the SimpleLinearRegression class with several output methods, and generate a report that presents intermediate and summary values in tabular and graphical formats so that conclusions can more readily be drawn from the data. Stay tuned!
| Name | Size | Download method |
|---|---|---|
| SimpleLinearRegression_class.txt | HTTP |
Information about download methods
- The popular undergraduate textbook, Statistics, 9th ed., by James T. McClave and Terry Sincich (Prentice-Hall, online) was consulted for the algorithm steps and for the "Burnout Study" example used in this article.
- Consult the PEAR repository, which currently contains a small number of low-level PHP math classes. Eventually, it would be nice to see PEAR contain packages that implement standard higher-level numerical methods such as SimpleLinearRegression, MultipleRegression, TimeSeries, ANOVA, FactorAnalysis, FourierAnalysis, and others.
- View or download all of the source code for the author's SimpleLinearRegression class.
- Look at the Numerical Python Project which extends Python with a full scientific array language complete with sophisticated indexing. Mathematical operations with this extension are close to what one would expect from a compiled language.
- Explore a number of math resources available for Perl, including an index of CPAN Math modules and the modules in the Algorithm section at CPAN, as well as the Perl Data Language, designed to deliver to Perl the ability to compactly store and speedily manipulate large N-dimensional data arrays.
- For more on John Chambers' S programming language, check out these links to his publications and various research projects at Bell Labs. Also read about its ACM Award in 1998 for language design.
- R is a language and environment for statistical computing and graphics, similar to the award-winning S system, that provides such statistical and graphical techniques as linear and nonlinear modeling, statistical tests, time series analysis, classification, clustering, and such. Learn about R at the R Project homepage.
- If you are new to PHP, read Amol Hatwar's developerWorks series, "Develop rock-solid code in PHP:" Part 1: Laying the Foundation" (August, 2002), "Part 2: Use variables effectively" (September, 2002), and "Part 3: Write reusable functions" (November, 2002).
- Try Steven Gould's IBM tutorial on "Writing Efficient PHP" for a catalog of code-optimization techniques (developerWorks, July 2002).
- Read this developerWorks roundup of math library articles:
- Lou Grinzo's "Mondo math libs" for Linux (December 1999)
- Michael Juntao Yuan's "Building dynamic Web sites with mathematical content" (February 2002)
- The IBM Accurate Portable Mathlib Project
Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current projects and interests center around e-learning, content management, statistical computing, and database technology. Paul works out of his home in Truro, Nova Scotia and can be reached at paul@datavore.com.
Comments (Undergoing maintenance)





