Skip to main content

Simple linear regression with PHP: Part 1

The importance of a math library in PHP

Paul Meagher (paul@datavore.com), CEO, Datavore Productions
Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current projects and interests center around e-learning, content management, statistical computing, and database technology. Paul works out of his home in Truro, Nova Scotia and can be reached at paul@datavore.com.

Summary:  A missing, but powerful, tool in the PHP arena is a language-based math library. In this two-part series, Paul Meagher hopes to inspire PHP developers to develop and implement a PHP-based math library by providing you with an example of how a library of analytic models might be developed. In this first part, he demonstrates how to develop and implement the heart of a Simple Linear Regression algorithm package using PHP as the implementation language. In Part 2, the author adds features to the package for a useful data-analysis tool for small- to medium-sized datasets.

Date:  01 Mar 2003
Level:  Intermediate
Activity:  2123 views

Introduction

In contrast with other open source languages like Perl and Python, PHP lacks a robust community effort to develop a math library.

One reason for this state of affairs may the abundance of existing sophisticated math tools that might dwarf any home-grown PHP effort. For example, a powerful tool that I've been researching, the S System, has an impressive array of statistical libraries designed for analyzing datasets and was awarded an ACM Award in 1998 for its language design. If S or its opensource cousin R is merely an exec_shell call away, why bother implementing the same statistical computing functionality in PHP? For more on the S System, its ACM Award, or R, see Resources.

After all, isn't this a waste of developer effort? If the motivations for developing a PHP math library are bounded by concerns for conserving developer effort and using the best tool for the job, then PHP's present course makes sense.

On the other hand, educational motivations might inspire the development of a PHP math library. For about 10 percent of the population, mathematics is an interesting topic to explore. For those who are also fluent in PHP, the development of a PHP math library can reinforce the mathematics learning process -- in other words, don't just read a chapter on T tests, implement a class that computes the relevant intermediate values and displays them in a standard format.

With guidance and coaching, I hope to demonstrate that developing PHP math libraries is not a difficult task and may represent an interesting technical and learning challenge. In this article, I'll provide an example of a PHP math library called SimpleLinearRegression that demonstrates a general approach that can be used to develop PHP math libraries. Let's begin by discussing some general principles that guided my development of this SimpleLinearRegression class.


Guiding principles

I used six general principles to guide the development of the SimpleLinearRegression class.

  1. Establish one class per analytical model.
  2. Employ backward chaining to develop the class.
  3. Expect an abundance of getters.
  4. Store intermediate results.
  5. Develop a preference for a verbose API.
  6. Perfection is not the goal.

Let's look at each of these guidelines in more detail.

Establish one class per analytical model

Each major type of analytical test or procedure should have a PHP class by the same name that contains input functions, functions that compute intermediate and summary values, and output functions (functions that dump the intermediate and summary values to the screen in textual or graphic form).

Employ backward chaining to develop the class

In mathematical programming, the goal of coding is often the standard output values that the analytical procedure (such as MultipleRegression, TimeSeries, or ChiSquared) is expected to generate. From a problem-solving point of view, this means that you can employ backward chaining to develop the methods of a mathematical class.

For example, the summary output screen displays one or more summary statistics. These summary statistics depend upon computing intermediate statistics, which may in turn involve further intermediate statistics, and so on. This backward-chaining-based development methodology leads to the next principle.

Expect an abundance of getters

The majority of the class-development work for a mathematical class involves computing intermediate and summary values. Practically, this means that you should not be suprised if your class contains many getter methods that compute intermediate and summary values.

Store intermediate results

Store the results of intermediate calculations inside a result object so you can use intermediate results as input to subsequent calculations. This is a principle that is enforced in design of the S language. In the present context it is enforced by selection of instance variables to represent computed intermediate and summary results.

Develop a preference for a verbose API

When developing the naming scheme for member functions and instance variables in my SimpleLinearRegression class, I found it easier to keep track of what my functions did and what my variables stood for when I used longer names to describe them (names like getSumSquaredError instead of getYY2).

I did not totally give up on abbreviating the names; however, when I did abbreviate, I tried to provide comments to fully elaborate the meaning of the name. Highly abbreviated naming schemes are too common in mathematical programming in my opinion -- they make it more difficult than necessary to understand and verify that a mathematical routine works as it should.

Perfection is not the goal

The goal of this coding exercise is not necessarily to develop a highly optimized and rigorous math engine for PHP. In these early stages, the learning and challenge aspects of implementing significant analytical tests should be emphasized.


The instance variables

When modeling a statistical test or procedure, you need to figure out which instance variables to declare.

The selection of instance variables can be determined by noting the intermediate and summary values to be generated by the analytic procedure. Each intermediate and summary value can have a corresponding instance variable to hold its value as an object attribute.

I conducted such an analysis to determine which variables to declare for the SimpleLinearRegression class in Listing 1. A similiar analysis would be performed for a MultipleRegression, ANOVA, or TimeSeries procedure.


Listing 1. Instance variables for the SimpleLinearRegression class

<?php 

// Copyright 2003, Paul Meagher 
// Distributed under GPL   

class SimpleLinearRegression { 

  var $n; 
  var $X = array();
  var $Y = array();  
  var $ConfInt;  
  var $Alpha;
  var $XMean;
  var $YMean;
  var $SumXX;
  var $SumXY;
  var $SumYY;  
  var $Slope;
  var $YInt;  
  var $PredictedY   = array();
  var $Error        = array();
  var $SquaredError = array();
  var $TotalError;  
  var $SumError;
  var $SumSquaredError;  
  var $ErrorVariance;
  var $StdErr;
  var $SlopeStdErr;  
  var $SlopeVal;   // T value of Slope 
  var $YIntStdErr;    
  var $YIntTVal;   // T value for Y Intercept
  var $R;
  var $RSquared;    
  var $DF;         // Degrees of Freedom
  var $SlopeProb;  // Probability of Slope Estimate
  var $YIntProb;   // Probability of Y Intercept Estimate
  var $AlphaTVal;  // T Value for given alpha setting
  var $ConfIntOfSlope;        
  
  var $RPath  = "/usr/local/bin/R";  // Your path here
  
  var $format = "%01.2f"; // Used for formatting output 
  
}
?>


The constructor

The constructor method for the SimpleLinearRegression class accepts an X and a Y vector with the same number of values in each vector. You can also set a confidence interval for your predicted Y values (default is a 95 percent confidence interval).

The constructor method begins by verifying that the data is in a form suitable for processing. Once the input vectors have passed the "equal size" and "size greater than 1" tests, the heart of the algorithm is executed.

Performing this task involves computing intermediate and summary values for the statistical procedure through a series of getter methods. The return value from each method call is assigned to an instance variable for the class. Storing calculational results in this way ensures that the intermediate and summary values can be used by the calling routine in chained calculations. The results can also be displayed by calling the output methods for the class, as is described in Listing 2.


Listing 2. Calling class output methods

<?php 

// Copyright 2003, Paul Meagher 
// Distributed under GPL   

function SimpleLinearRegression($X, $Y, $ConfidenceInterval="95") {

  $numX = count($X);
  $numY = count($Y);

  if ($numX != $numY) {
    die("Error: Size of X and Y vectors must be the same.");

  } 
  if ($numX <= 1) { 
    die("Error: Size of input array must be at least 2.");
  }
  
  $this->n               = $numX;
  $this->X               = $X;
  $this->Y               = $Y;  
  
  $this->ConfInt         = $ConfidenceInterval; 
  $this->Alpha           = (1 + ($this->ConfInt / 100) ) / 2;

  $this->XMean           = $this->getMean($this->X);
  $this->YMean           = $this->getMean($this->Y);
  $this->SumXX           = $this->getSumXX();
  $this->SumYY           = $this->getSumYY();
  $this->SumXY           = $this->getSumXY();    
  $this->Slope           = $this->getSlope();
  $this->YInt            = $this->getYInt();
  $this->PredictedY      = $this->getPredictedY();
  $this->Error           = $this->getError();
  $this->SquaredError    = $this->getSquaredError();
  $this->SumError        = $this->getSumError();
  $this->TotalError      = $this->getTotalError();    
  $this->SumSquaredError = $this->getSumSquaredError();
  $this->ErrorVariance   = $this->getErrorVariance();
  $this->StdErr          = $this->getStdErr();  
  $this->SlopeStdErr     = $this->getSlopeStdErr();     
  $this->YIntStdErr      = $this->getYIntStdErr();         
  $this->SlopeTVal       = $this->getSlopeTVal();            
  $this->YIntTVal        = $this->getYIntTVal();                
  $this->R               = $this->getR();   
  $this->RSquared        = $this->getRSquared();
  $this->DF              = $this->getDF();          
  $this->SlopeProb       = $this->getStudentProb($this->SlopeTVal, $this->DF);
  $this->YIntProb        = $this->getStudentProb($this->YIntTVal, $this->DF);
  $this->AlphaTVal       = $this->getInverseStudentProb($this->Alpha, $this->DF);
  $this->ConfIntOfSlope  = $this->getConfIntOfSlope(); 

  return true;
}

?>

The method names and their sequence were derived by a combination of backward chaining and consulting an undergraduate statistics textbook that provided step-by-step instructions for computing intermediate values. The names of the intermediate values that I needed to compute were prefixed with "get" to derive the method name.


Fit the model to the data

The SimpleLinearRegression procedure is used to fit a straight line to the data in which the straight line has the following standard form:

y = b + mx

The PHP form of this equation would look something like Listing 3:


Listing 3. PHP equation that fits the model to the data

$PredictedY[$i] = $YIntercept + $Slope * $X[$i]

The SimpleLinearRegression class uses a least-squares criterion for deriving estimates of what the Y Intercept and Slope parameters should be. These estimated parameters are used to construct a linear equation (see Listing 3) to model the relationship between the X and Y values.

Using the derived linear equation, you can then obtain predicted Y values for each X value. If the linear equation is a good fit to the data, then the observed and predicted Y values tend to agree.

How to determine a good fit

The SimpleLinearRegression class generates a fairly large number of summary values. One important summary value is a T statistic that can be used to measure how well a linear equation fits the data. If the fit is good, then the T statistic tends to have a large value. If the T statistic is small, the linear equation should be replaced by a model that assumes the mean of the Y values is the best predictor (that is, the mean of a set of values is often a useful predictor of the next observed value making it the default model).

To test whether the T statistic is large enough to reject the mean of the Y values as the best predictor, you need to compute the probability of obtaining the T statistic by chance. If the probability of obtaining a T statistic is low, then you can reject the null hypothesis that the mean is the best predictor and, correspondingly, gain confidence that a simple linear model offers a good fit for the data.

So, how do you compute the probability of the T statistic?

Compute the T statistic probability

Because PHP lacks mathematical routines to compute the probability of a T statistic, I decided to shell out to the statistical computing package R (see www.r-project.org in Resources) to obtain the necessary values. I also wanted to raise awareness about this package because:

  1. R provides quite a few ideas PHP developers might want to emulate in a PHP math library
  2. With R, you can confirm that values obtained from a PHP math library agree with those obtained from a mature, freely available, open source statistical package.

The code in Listing 4 demonstrates just how easy it is to shell out to R for one value.


Listing 4. Shell out to the R statistical computing package for one value

<?php 

// Copyright 2003, Paul Meagher 
// Distributed under GPL   

class SimpleLinearRegression { 
   
  var $RPath  = "/usr/local/bin/R";  // Your path here

  function getStudentProb($T, $df) {    
    $Probability = 0.0;   
    $cmd = "echo 'dt($T, $df)' | $this->RPath --slave"; 
    $result = shell_exec($cmd);    
    list($LineNumber, $Probability) = explode(" ", trim($result)); 
    return $Probability;
  }

  function getInverseStudentProb($alpha, $df) {  
    $InverseProbability = 0.0; 
    $cmd = "echo 'qt($alpha, $df)' | $this->RPath --slave"; 
    $result = shell_exec($cmd);  
    list($LineNumber, $InverseProbability) = explode(" ", trim($result)); 
    return $InverseProbability;
  }

}

?>

Note that the path to the R executable is set and used in the two functions. The first function returns a probability value associated with a T statistic based upon the Students T distribution, while the second inverse function computes the T statistic corresponding to a given alpha setting. The getStudentProb method is used to assess the fit of the linear model; the getInverseStudentProb method returns an intermediate value used to compute a confidence interval for each predicted Y value.

Space constraints keep me from going into detail about all the functions in this class, so I encourage you to consult an undergraduate statistics textbook if you want to understand the termininology and steps involved in a Simple Linear Regression analysis.


The burnout study

To demonstrate how the class is used, I can use data from a study on burnout in human services. Michael Leiter and Kimberly Ann Meechan studied the relationship between a measure of burnout called the Exhaustion Index and an independent variable they called Concentration. Concentration refers to the proportion of a person's social contacts that come from their work environment.

To examine the relationship between Exhaustion Index scores and Concentration scores for individuals from their sample, load these scores into appropriately named arrays and instantiate the class with these array values. After instantiating the class, display some of the summary values generated by the class to assess the degree to which a linear model fits the data.

Listing 5 shows a script that loads the data and displays summary values:


Listing 5. A script for loading data and displaying summary values

<?php 

// BurnoutStudy.php

// Copyright 2003, Paul Meagher 
// Distributed under GPL   

include "SimpleLinearRegression.php"; 

// Load data from burnout study 

$Concentration   = array(20,60,38,88,79,87, 
                         68,12,35,70,80,92, 
                         77,86,83,79,75,81, 
                         75,77,77,77,17,85,96);  
                          
$ExhaustionIndex = array(100,525,300,980,310,900, 
                         410,296,120,501,920,810, 
                         506,493,892,527,600,855, 
                         709,791,718,684,141,400,970);  
                          
$slr = new SimpleLinearRegression($Concentration, $ExhaustionIndex);  

$YInt      = sprintf($slr->format, $slr->YInt);
$Slope     = sprintf($slr->format, $slr->Slope);    
$SlopeTVal = sprintf($slr->format, $slr->SlopeTVal);    
$SlopeProb = sprintf("%01.6f", $slr->SlopeProb);    

?>

<table border='1' cellpadding='5'>
  <tr>
    <th align='right'>Equation:</th>
    <td></td>
  </tr>
  <tr>
    <th align='right'>T:</th>
    <td></td>
  </tr>
  <tr>
    <th align='right'>Prob > T:</th>
    <td><td>
  </tr>
</table>

Run this script through a Web browser to produce the following output:

Equation:Exhaustion = -29.50 + (8.87 * Concentration)
T:6.03
Prob > T:0.000005

The bottom row of this table tells that the probability of obtaining a T value this large by chance is very low. You can conclude that a simple linear model offers more predictive power than simply using the mean of the Exhaustion scores.

Knowing the concentration of workplace contacts for an individual can be used to predict the degree of burnout they are likely to be experiencing. The equation tells us that for each one-unit increase in concentration scores, a person in the social services field experiences an eight-unit increase in Exhaustion scores. This provides some evidence that to reduce the potential for burnout, individuals in the social services field should consider making friends outside of their workplace.

This is a quick sketch of what these results might mean. To fully explore the implications of this dataset, you would want to study the data in more detail to make sure this is the proper interpretation. In the next article I discuss what additional analyses should be performed.


What have you learned?

For one, you don't have to be a rocket scientist to develop a significant PHP-based math package. Adhering to standard object-oriented techniques, as well as explicitly adopting a backward-chaining problem-solving approach, makes implementing some of the more basic statistical procedures using PHP relatively straightforward.

From an educational point of view, I would argue that this exercise is useful if for no other reason than it forces you to think at both a higher and lower level of abstraction about a statistical test or routine. In other words, a good way to complement your learning of a statistical test or procedure is to implement that procedure as an algorithm.

To implement a statistical test often requires going beyond the information given and engaging in creative problem solving and discovery. It is also a good way to discover holes in your knowledge of a subject.

On a negative note, you discovered that PHP has no native access to sampling distributions, which is a requirement for implementing most statistical tests. You shelled out to R to get these values, but I am worried that you will not have the time or inclination to install R. A native PHP implementation of some common probability functions would resolve that.

Another problem is that the class generates many intermediate and summary values, but the summary output really didn't take advantage of this. I provided some teaser output, but this was not sufficently extensive nor well organized so that you could adequately interpret the results of the analysis. In fact, I was entirely silent about how output methods might be integrated into this class. This needs to be addressed.

Finally, understanding your data means more than just looking at summary values. You also need to get a sense of how individual data points are distributed. One of the best ways to do this is by graphing your data. Again, I have been silent on this topic, but it needs to be addressed if the class is to be used for analysis of real data.

In the next article in this series, I'll implement some probability functions using native PHP code, extend the SimpleLinearRegression class with several output methods, and generate a report that presents intermediate and summary values in tabular and graphical formats so that conclusions can more readily be drawn from the data. Stay tuned!



Download

NameSizeDownload method
SimpleLinearRegression_class.txt HTTP

Information about download methods


Resources

  • The popular undergraduate textbook, Statistics, 9th ed., by James T. McClave and Terry Sincich (Prentice-Hall, online) was consulted for the algorithm steps and for the "Burnout Study" example used in this article.

  • Consult the PEAR repository, which currently contains a small number of low-level PHP math classes. Eventually, it would be nice to see PEAR contain packages that implement standard higher-level numerical methods such as SimpleLinearRegression, MultipleRegression, TimeSeries, ANOVA, FactorAnalysis, FourierAnalysis, and others.

  • View or download all of the source code for the author's SimpleLinearRegression class.

  • Look at the Numerical Python Project which extends Python with a full scientific array language complete with sophisticated indexing. Mathematical operations with this extension are close to what one would expect from a compiled language.

  • Explore a number of math resources available for Perl, including an index of CPAN Math modules and the modules in the Algorithm section at CPAN, as well as the Perl Data Language, designed to deliver to Perl the ability to compactly store and speedily manipulate large N-dimensional data arrays.

  • For more on John Chambers' S programming language, check out these links to his publications and various research projects at Bell Labs. Also read about its ACM Award in 1998 for language design.

  • R is a language and environment for statistical computing and graphics, similar to the award-winning S system, that provides such statistical and graphical techniques as linear and nonlinear modeling, statistical tests, time series analysis, classification, clustering, and such. Learn about R at the R Project homepage.

  • If you are new to PHP, read Amol Hatwar's developerWorks series, "Develop rock-solid code in PHP:" Part 1: Laying the Foundation" (August, 2002), "Part 2: Use variables effectively" (September, 2002), and "Part 3: Write reusable functions" (November, 2002).

  • Try Steven Gould's IBM tutorial on "Writing Efficient PHP" for a catalog of code-optimization techniques (developerWorks, July 2002).

  • Read this developerWorks roundup of math library articles:

About the author

Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current projects and interests center around e-learning, content management, statistical computing, and database technology. Paul works out of his home in Truro, Nova Scotia and can be reached at paul@datavore.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Open source
ArticleID=11756
ArticleTitle=Simple linear regression with PHP: Part 1
publish-date=03012003
author1-email=paul@datavore.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers