Statistics is nothing if not an exercise in optimization, beginning with the sample average and moving on from there. The average of a group of numbers minimizes the sum of squared deviations. In other words, the average minimizes the sum of terms like (x_i-avg)^2. The median minimizes the sum of absolute deviations—terms like |x_i-med|.
The underlying strategy for most statistical reasoning is:
- Write down a probability model that should account for the data.
- This model will contain some unknown constants, or parameters.
- Collect the data.
- Find values of the parameters that best account for the data.
That last point contains the optimization. Maximum likelihood is an optimization procedure that selects the most plausible parameter values for the data you got. Parameters can be estimated in a number of ways, but all of them involve an optimization.
Interestingly, you can optimize most of the models taught in a beginning statistics course with a closed-form expression. Mathematically, finding the mean and variance of normal data, estimating a proportion, fitting a regression line, and modeling treatment and block effects in experimental design are all optimization problems. Their solutions emerge as the solution of a system of linear equations. The statistician hands over the problem to a software package, trusting that the linear algebra algorithm will produce stable estimates of the parameters.
For just about any model outside of the linear models I mentioned, closed-form solutions do not exist. You need not go far to find practical examples. Consider a bank that wants to predict whether prospective customers will default on their mortgages. The bank's historical data show the default status (a binary outcome coded Yes-No) of numerous customers, together with personal and financial information, such as employment status, years with current employer, salary, marital status. Estimating the contributions of each of these pieces of information boils down to estimating the parameters of a logistic equation. No closed-form expression exists for the best parameters of this model, so a numerical procedure is required to estimate the parameters and optimize the likelihood.
The same is true of other generalized linear models, structural equation models, and many other models that are used in modern statistics. Most of the heavy lifting in these problems, and the software that delivers the results, relies on a numerical optimization algorithm.
The usual graduate program in statistics, even at a good school, teaches you a lot about the theoretical properties of these estimates. And the theory, to be sure, paints an optimistic picture. Theory shows that under certain conditions, most people forget with time, and given a large enough sample size, the solution to the optimization problem (the maximum likelihood estimate) is the solution to the estimation problem. Solving the maximum likelihood problem gives you the estimates necessary to complete specification of the model. Even better, the estimates behave nicely (are normally distributed). You can even estimate how accurate they are from the curvature of the maximum likelihood function near its optimum value. At least, if the sample size is large enough, these nice properties will hold.
You then hand the problem over to an optimization algorithm, confident that your work is done. But is that confidence well placed?
A logistic growth curve model
Consider the example in Figure 1. Twelve observations were made in successive time periods, and a logistic growth curve was fit to the data.
Figure 1. Logistic fit to weed data
The model clearly fits well, and you might expect that the optimization
went smoothly. The example comes from the help file of the
nlmrt function, also written by John Nash. The data is from a
study on weeds that was carried out at Agriculture Canada. This simple
data set manages to break most optimizers sent its way. Parameter
estimates were obtained by using
nlmrt, but the popular and
nls function fails to converge on this data set.
nls and the functions in
designed for the special case of minimizing a nonlinear least squares
objective function. The code and the output for this example are shown in
Figure 2. An R script is available from Downloads.
Figure 2. The weed data with nlmrt
Now, Figure 3 shows what happens by using
nls. It fails to converge.
Figure 3. The weed data with nls
R has many optimizers to handle special cases, but for general-purpose
optimization, many people choose
optim. This function is part
of Base-R. Not long ago, I ran into
optim author John Nash at
an Ottawa R user meet-up group. The subject of optimization and
optim came up, and John announced with enthusiasm, "I wrote
it; but I don't want you to use it." I was intrigued and wanted to learn
more. John agreed to answer my questions about it, and a few days later,
we sat down in a Timothy's cafe with our lattes and laptops. This is the
substance of our conversation.
About John Nash
No relation to beautiful mind game theorist John Nash, optimizer
Nash (see Figure 4) is a Canadian mathematician. He
earned a D.Phil. from Oxford in 1972 from the Mathematics Institute under
Charles Coulson. He then worked at Agriculture Canada in Ottawa until
1980. While there, he wrote the routines that later became part of the
optim function. The routines were written in Basic, with the
constraint that program and data take no more than 8 KB of memory. Yes,
you read that correctly: 8 kilobytes. This was the 1970s. Based on this
experience, John wrote Compact Numerical Methods for Computers: Linear
Algebra and Function Minimization, published in 1979. A second
edition (1990) of this book is still in print, which much be something of
a record for a book of computer algorithms. He joined the faculty of the
Telfer School of Management in 1981 (University of Ottawa) and remained
there until his retirement in 2005. He continues to be active in
consulting and R programming and is the maintainer of several R packages
on optimization, including
Figure 4. John Nash in traditional English costume
- Tell me something about the optimization functions in R.
- The first thing you need to understand is that there are two classes of optimization problems: function minimization and mathematical programming. Both seek an optimum, and both involve constraints, but mathematical programming begins with the constraints. It emphasizes solving the constraints, and then looks for the best solution. Take a look at Mike Trick's website. Every year, he optimizes the team schedule for major league baseball. The goal is to maximize attendance, but many constraints exist. The teams can't be on the road for more than seven days at a time, and so on. Function minimization starts with the objective, and then worries about satisfying the constraints. Many problems can be solved using either approach. A huge amount of commercial effort is going into math programming solvers these days. Most of the optimizers in R are function minimizers.
- I get that. In classical statistics, function minimization is usually what you need to do, either for maximum likelihood or least squares.
I confessed that most of what I knew about optimization came from the day in first-year calculus that covered the Newton-Raphson method. John was not surprised. Optimization is a black box for many statisticians who know little about it, and he set out to put me straight.
- Newton will take you a long way, and many successful algorithms still follow gradients in their search for the optima of the objective function.
John took a sheet from my notebook and made a quick sketch (see Figure 5).
Figure 5. Iteration step in Newton-Raphson
Most optimizers like to be consistent: They recast all problems as either a search for the minimum or a search for the maximum. John likes minima. If the objective function is continuous, he explained, then the minimum occurs where the derivative is 0. So, the optimization problem becomes a problem of finding a zero. With Newton-Raphson, you pick a starting point, S_0, then follow the tangent down to the x-axis. Repeat with the new starting point. If all goes well, it converges quickly. The problem—and John made another sketch (see Figure 6)—is what happens in this case.
Figure 6. Wrong direction
"It takes three lines of code to write the iteration step, and 3000 lines to avoid getting stuck or heading off in the wrong direction. The bad thing about optimization is that everyone puts together quasi-Newton methods that are supposed to be better, but they only test the new method on the method where X (the old method) failed, not on the examples were X works. So, there is an ongoing process of re-inventing the wheel."
- How did you come to write the
- I wrote three of the routines that figure in
optimin the 1970s in Basic. In 1990, for the second edition of the book, the codes were in Pascal. In 1995, I was in Oxford and went to South Parks Road to say hello to Brian Ripley. He asked if he could use my codes in R. I said yes. Ripley packaged the codes and named it
optim. He included the
CGfunction, because he needed something to handle large numbers of parameters —say, around 50,000.
The optim function bundles six different optimization methods
The user can select one of six different optimization methods:
- Broyden-Fletcher-Goldfarb-Shanno (BFGS)
- Conjugate gradient (CG)
- Limited-memory BFGS (L-BFGS-B)
- Simulated annealing (SANN)
For a complete description, see Resources.
The default method is to run a Nelder-Mead simplex algorithm. The method does not require gradients and relies only on evaluations of the function itself. The idea is fairly intuitive. To minimize a function of n parameters, construct a simplex of n+1 vertices. Evaluate the function at each vertex. For two parameters, the simplex is a triangle (see Figure 7).
Figure 7. Nelder-Mead simplex in two dimensions
The function is evaluated at each vertex. Suppose that the highest values are at H, the lowest at L, with P taking on a value in between. The idea is to "walk" the simplex around the plane until it locates the minimum.
- Replace H by one of the four candidate points: C1, C2, C3, or C4.
- The candidates lie along the line that joins H to the centroid (mid-point) of L and P.
- Evaluate the function at C1.
If this value is smaller than L, then this line looks like a good line of descent. Look further at C2.
- Check the value at C2.
If this value is the new minimum, replace H with C2. The simplex will stretch along that direction.
- Otherwise, replace H with C1.
- If the value at C1 is between the values at L and P, replace H with C1.
- Suppose that C1 is greater than P. If it
is lower than H, replace H with
C3 (low side). If C1 marks
a new high point, replace H with C4 (high
In both of these cases, the simplex is shrunk down.
In the three-parameter case, the simplex is a tetrahedron that rolls around
three-dimensional (3D) space. Duncan Murdoch wrote a nice visualization
with his 3D package,
rgl (see Figure 8).
The objective function is a mixture of three normal distributions, and the
Nelder-Mead seeks a mode. An R script (the Weeds file in the Code folder)
for the code is available in Downloads.
Figure 8. A Nelder-Mead simulation for a three-parameter problem
The Nelder-Mead method is a direct search method. No gradients are required. It can be slow but is usually reliable, making it a good default, non-customized option.
BFGS, CG, and L-BFGS-B
These methods use gradients. The user can supply code to calculate the gradient, or gradients can be calculated from function evaluations. In John's experience, "analytic derivatives usually but not always give more accurate solutions much more efficiently." BFGS and L-BFGS-B are quasi-Newton methods. L-BFGS-B optimizes in the presence of box constraints.
CG is a conjugate gradient method. Conjugate gradient methods work by forming a sequence of quadratic approximations to the function (by using the Hessian). Each quadratic is optimized by seeking through n orthogonal directions (where orthogonality is measured with respect to the Hessian). A linear search restarts the quadratic approximation in a better location until convergence is obtained.
Simulated annealing is a stochastic method but not a true optimizer. It
performs a certain number of iterations and stops, whether it found
anything or not. It cannot report anything about convergence. SANN was
dropped in the more recent
The Brent method calls function
optimize, which does one
Performance criteria and nonconvergence
Every optimizer uses some kind of iterative algorithm. In happy situations, the result does not depend on the starting values that you supply. A useful check is to start the optimizer from random starting points to see if you have the same result each time. Failure might be a feature of the particular problem and the data you have, but some optimizers are better than others.
R is an interpreted language, but it can make calls to compiled code in
C or Fortran for added speed. I ask John about calls to
C or Fortran for speeding up the optimization.
You want to spend your effort on speeding up the objective function. It is
sometimes worth putting the objective function in Fortran. The
optim() function can accept a
to compiled code instead of an R function (to evaluate the objective
function). Write the optimizer in R so everyone can see the
implementation. The R package that is called
runs something like a hundred times and gives you the distribution of
times taken. This information is useful if you want to see where a problem
is taking time.
To make a point about speed, John types in a 10,000-parameter problem on
his laptop to see how long it takes with
Rcgmin. We sit and
watch. A minute later, it returns with the correct answer. Not bad for a
Good benchmarking is a challenge in R, John points out. As an open source project, R gets many of its contributors from academia. But benchmarking is not rewarded.
When the optimizer fails
An optimizer can fail for a number of reasons, some easier to spot than others. The iteration might reach a singular gradient and be forced to stop, or it exceeds the maximum number of iterations without converging. These conditions issue a warning to the user, but some problems are more subtle. Many statistical problems have constraints. Variances must be positive; covariance matrices must be positive definite. Often, an unconstrained optimization is done in the hope that the discovered optimum meets the constraints.
Statistical theory asserts that if the model is correct and the sample size large enough, the likelihood function achieves a maximum near the true parameters, and the log likelihood is roughly quadratic in that neighbourhood. You would think that the optimum would not be difficult to find, but you would be mistaken. The model might be false, the sample might be too small, and the function might have plateaus or local optima that confuse the optimizer.
Newton-style optimizers seek zeros of the derivatives, which might correspond to minima, maxima, or saddle points. As John told me, "the message is we have lots of good tools, but you can cut off your fingers pretty easily."
To avoid digital amputation, John developed the
with Ravi Varadhan. This package improves
optim in several
ways and is its recommended replacement.
Optimx is primarily
a wrapper for a number of optimizers in R. It currently offers 12 methods,
and you can add more.
Optimx provides a standard interface to
these functions and runs as many of them as you choose. Output includes
parameter estimates, convergence status, and the value of the objective
function at the minimum.
Optimx can also check the Karush-Kuhn-Tucker conditions to
determine whether the optimizer indeed converged to a minimum, as distinct
from a saddle point or maximum.
Optimx also contains improved versions of some of the original
optim functions. The new
Rcgmin replaces CG as
the conjugate gradient optimizer for problems with many parameters. SANN
was dropped because it gives no measure of convergence.
Figure 9 shows the weed data again. The result is sobering to consider. Only four of the seven methods found plausible (well-fitting) values, and none of the methods believed that it found the minimum.
Figure 9. The weed data with optimx
Reading the output
optimx gives one row of output for each method. The estimated
parameters are given first, followed by the
value of the
function at the supposed minimum. The next three columns show the number
of times the objective function was evaluated, the number of times the
gradient function was evaluated, and the number of iterations performed
(for those methods that report iterations.)
optimx offers several convergence codes that indicate the type
of failure, so consult the manual for details. Code 0 is good. In this
example, CG reached its maximum number of iterations (Code 1) and stopped.
kkt1 of True means that the final gradient was close to 0
(the optimizer found an extremum), and a
kkt2 of True means
that the Hessian is positive definite (it's a minimum). Both should be
True. In this example, Nelder-Mead stopped at an extremum but did not find
a minimum—this is evident from its stopping value, which is three
orders of magnitude larger than the "minima" other methods found.
optimx gives you much information about your data and your
problem. Certainly, if the methods fail to agree, this disagreement
suggests that the problem is a tricky one and that you should look further
to determine why. For many statistical situations, the difficulty is that
you are attempting the foolish: The model does not fit, or the sample is
too small. Under those circumstances, the likelihood function is not
guaranteed to have a strongly defined minimum at realistic parameter
values. By realistic, I mean variances greater than 0 or
probabilities between 0 and 1. But even where a problem makes sense from a
statistical standpoint, the optimization itself might be problematic.
Optimization is a key element in the analysis of many data sets, yet one
that statisticians and data analysts frequently ignore. We leave this
important task to the default settings of the software, often without
considering whether these settings are the best choices for our particular
problem. As I learned from John,
optimx offers several tools
to the analyst. You might need to try several algorithms and various
starting points to assess difficulties in convergence that might be
present. The Karush-Kuhn-Tucker tests can check whether a minimum was
found or something else. These methods do not guarantee complete success,
but most of the time, they can indicate the presence of problems in the
|Sample R code for this article||R_code_examples.zip||4KB|
- Compact Numerical Methods for Computers: Linear Algebra and Function
Minimisation, second edition (J. C. Nash, Taylor & Francis,
1990): Explore the best resource for the methods that are used in R's
optimxfunctions. The text offers algorithms and explanations for why they work.
- Statistical Inference (George Casella and Roger L. Berger. Wadsworth and Brooks, 2001): Check out this solid introduction to statistical inference, including the theory of maximum likelihood estimation.
- Convergence theorems for a class of simulated annealing algorithms on Rd (C.J.P. Belisle, Journal of Applied Probability 29:885–895, 1992): Read more about SANN in this article.
- The Comprehensive R Archive Network: Visit the main site for the R
project and each R package. The help pages and manuals that are associated
Rcgminare detailed. Numerous references are provided.
- Unifying Optimization
Algorithms to Aid Software System Users: optimx for R (John Nash
and Ravi Varadhan, Journal of Statistical Software 43(9):1-14,
2011): Read more on
optimxin this article.
- Find the resources that you need to improve outcomes and control risk in the developerWorks Business analytics zone.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.