Statistics is nothing if not an exercise in optimization, beginning with the sample average and moving on from there. The average of a group of numbers minimizes the sum of squared deviations. In other words, the average minimizes the sum of terms like (x_iavg)^2. The median minimizes the sum of absolute deviations—terms like x_imed.
The underlying strategy for most statistical reasoning is:
 Write down a probability model that should account for the data.
 This model will contain some unknown constants, or parameters.
 Collect the data.
 Find values of the parameters that best account for the data.
That last point contains the optimization. Maximum likelihood is an optimization procedure that selects the most plausible parameter values for the data you got. Parameters can be estimated in a number of ways, but all of them involve an optimization.
Interestingly, you can optimize most of the models taught in a beginning statistics course with a closedform expression. Mathematically, finding the mean and variance of normal data, estimating a proportion, fitting a regression line, and modeling treatment and block effects in experimental design are all optimization problems. Their solutions emerge as the solution of a system of linear equations. The statistician hands over the problem to a software package, trusting that the linear algebra algorithm will produce stable estimates of the parameters.
For just about any model outside of the linear models I mentioned, closedform solutions do not exist. You need not go far to find practical examples. Consider a bank that wants to predict whether prospective customers will default on their mortgages. The bank's historical data show the default status (a binary outcome coded YesNo) of numerous customers, together with personal and financial information, such as employment status, years with current employer, salary, marital status. Estimating the contributions of each of these pieces of information boils down to estimating the parameters of a logistic equation. No closedform expression exists for the best parameters of this model, so a numerical procedure is required to estimate the parameters and optimize the likelihood.
The same is true of other generalized linear models, structural equation models, and many other models that are used in modern statistics. Most of the heavy lifting in these problems, and the software that delivers the results, relies on a numerical optimization algorithm.
The usual graduate program in statistics, even at a good school, teaches you a lot about the theoretical properties of these estimates. And the theory, to be sure, paints an optimistic picture. Theory shows that under certain conditions, most people forget with time, and given a large enough sample size, the solution to the optimization problem (the maximum likelihood estimate) is the solution to the estimation problem. Solving the maximum likelihood problem gives you the estimates necessary to complete specification of the model. Even better, the estimates behave nicely (are normally distributed). You can even estimate how accurate they are from the curvature of the maximum likelihood function near its optimum value. At least, if the sample size is large enough, these nice properties will hold.
You then hand the problem over to an optimization algorithm, confident that your work is done. But is that confidence well placed?
A logistic growth curve model
Consider the example in Figure 1. Twelve observations were made in successive time periods, and a logistic growth curve was fit to the data.
Figure 1. Logistic fit to weed data
The model clearly fits well, and you might expect that the optimization
went smoothly. The example comes from the help file of the
nlmrt
function, also written by John Nash. The data is from a
study on weeds that was carried out at Agriculture Canada. This simple
data set manages to break most optimizers sent its way. Parameter
estimates were obtained by using nlmrt
, but the popular and
traditional nls
function fails to converge on this data set.
Function nls
and the functions in nlmrt
are
designed for the special case of minimizing a nonlinear least squares
objective function. The code and the output for this example are shown in
Figure 2. An R script is available from Downloads.
Figure 2. The weed data with nlmrt
Now, Figure 3 shows what happens by using
nls
. It fails to converge.
Figure 3. The weed data with nls
Optim
R has many optimizers to handle special cases, but for generalpurpose
optimization, many people choose optim
. This function is part
of BaseR. Not long ago, I ran into optim
author John Nash at
an Ottawa R user meetup group. The subject of optimization and
optim
came up, and John announced with enthusiasm, "I wrote
it; but I don't want you to use it." I was intrigued and wanted to learn
more. John agreed to answer my questions about it, and a few days later,
we sat down in a Timothy's cafe with our lattes and laptops. This is the
substance of our conversation.
About John Nash
No relation to beautiful mind game theorist John Nash, optimizer
Nash (see Figure 4) is a Canadian mathematician. He
earned a D.Phil. from Oxford in 1972 from the Mathematics Institute under
Charles Coulson. He then worked at Agriculture Canada in Ottawa until
1980. While there, he wrote the routines that later became part of the
optim
function. The routines were written in Basic, with the
constraint that program and data take no more than 8 KB of memory. Yes,
you read that correctly: 8 kilobytes. This was the 1970s. Based on this
experience, John wrote Compact Numerical Methods for Computers: Linear
Algebra and Function Minimization, published in 1979. A second
edition (1990) of this book is still in print, which much be something of
a record for a book of computer algorithms. He joined the faculty of the
Telfer School of Management in 1981 (University of Ottawa) and remained
there until his retirement in 2005. He continues to be active in
consulting and R programming and is the maintainer of several R packages
on optimization, including optim
and optimx
.
Figure 4. John Nash in traditional English costume
 Catherine
 Tell me something about the optimization functions in R.
 John
 The first thing you need to understand is that there are two classes of optimization problems: function minimization and mathematical programming. Both seek an optimum, and both involve constraints, but mathematical programming begins with the constraints. It emphasizes solving the constraints, and then looks for the best solution. Take a look at Mike Trick's website. Every year, he optimizes the team schedule for major league baseball. The goal is to maximize attendance, but many constraints exist. The teams can't be on the road for more than seven days at a time, and so on. Function minimization starts with the objective, and then worries about satisfying the constraints. Many problems can be solved using either approach. A huge amount of commercial effort is going into math programming solvers these days. Most of the optimizers in R are function minimizers.
 Catherine
 I get that. In classical statistics, function minimization is usually what you need to do, either for maximum likelihood or least squares.
I confessed that most of what I knew about optimization came from the day in firstyear calculus that covered the NewtonRaphson method. John was not surprised. Optimization is a black box for many statisticians who know little about it, and he set out to put me straight.
 John
 Newton will take you a long way, and many successful algorithms still follow gradients in their search for the optima of the objective function.
John took a sheet from my notebook and made a quick sketch (see Figure 5).
Figure 5. Iteration step in NewtonRaphson
Most optimizers like to be consistent: They recast all problems as either a search for the minimum or a search for the maximum. John likes minima. If the objective function is continuous, he explained, then the minimum occurs where the derivative is 0. So, the optimization problem becomes a problem of finding a zero. With NewtonRaphson, you pick a starting point, S_0, then follow the tangent down to the xaxis. Repeat with the new starting point. If all goes well, it converges quickly. The problem—and John made another sketch (see Figure 6)—is what happens in this case.
Figure 6. Wrong direction
"It takes three lines of code to write the iteration step, and 3000 lines to avoid getting stuck or heading off in the wrong direction. The bad thing about optimization is that everyone puts together quasiNewton methods that are supposed to be better, but they only test the new method on the method where X (the old method) failed, not on the examples were X works. So, there is an ongoing process of reinventing the wheel."
 Catherine
 How did you come to write the
optim
function?  John
 I wrote three of the routines that figure in
optim
in the 1970s in Basic. In 1990, for the second edition of the book, the codes were in Pascal. In 1995, I was in Oxford and went to South Parks Road to say hello to Brian Ripley. He asked if he could use my codes in R. I said yes. Ripley packaged the codes and named itoptim
. He included theCG
function, because he needed something to handle large numbers of parameters —say, around 50,000.
The optim function bundles six different optimization methods
The user can select one of six different optimization methods:
 NelderMead
 BroydenFletcherGoldfarbShanno (BFGS)
 Conjugate gradient (CG)
 Limitedmemory BFGS (LBFGSB)
 Simulated annealing (SANN)
 Brent
For a complete description, see Resources.
NelderMead
The default method is to run a NelderMead simplex algorithm. The method does not require gradients and relies only on evaluations of the function itself. The idea is fairly intuitive. To minimize a function of n parameters, construct a simplex of n+1 vertices. Evaluate the function at each vertex. For two parameters, the simplex is a triangle (see Figure 7).
Figure 7. NelderMead simplex in two dimensions
The function is evaluated at each vertex. Suppose that the highest values are at H, the lowest at L, with P taking on a value in between. The idea is to "walk" the simplex around the plane until it locates the minimum.
 Replace H by one of the four candidate points: C_{1}, C_{2}, C_{3}, or C_{4}.
 The candidates lie along the line that joins H to the centroid (midpoint) of L and P.
 Evaluate the function at C_{1}.
If this value is smaller than L, then this line looks like a good line of descent. Look further at C_{2}.
 Check the value at C_{2}.
If this value is the new minimum, replace H with C_{2}. The simplex will stretch along that direction.
 Otherwise, replace H with C_{1}.
 If the value at C_{1} is between the values at L and P, replace H with C_{1}.
 Suppose that C_{1} is greater than P. If it
is lower than H, replace H with
C_{3} (low side). If C_{1} marks
a new high point, replace H with C_{4} (high
side).
In both of these cases, the simplex is shrunk down.
In the threeparameter case, the simplex is a tetrahedron that rolls around
threedimensional (3D) space. Duncan Murdoch wrote a nice visualization
with his 3D package, rgl
(see Figure 8).
The objective function is a mixture of three normal distributions, and the
NelderMead seeks a mode. An R script (the Weeds file in the Code folder)
for the code is available in Downloads.
Figure 8. A NelderMead simulation for a threeparameter problem
The NelderMead method is a direct search method. No gradients are required. It can be slow but is usually reliable, making it a good default, noncustomized option.
BFGS, CG, and LBFGSB
These methods use gradients. The user can supply code to calculate the gradient, or gradients can be calculated from function evaluations. In John's experience, "analytic derivatives usually but not always give more accurate solutions much more efficiently." BFGS and LBFGSB are quasiNewton methods. LBFGSB optimizes in the presence of box constraints.
CG is a conjugate gradient method. Conjugate gradient methods work by forming a sequence of quadratic approximations to the function (by using the Hessian). Each quadratic is optimized by seeking through n orthogonal directions (where orthogonality is measured with respect to the Hessian). A linear search restarts the quadratic approximation in a better location until convergence is obtained.
SANN
Simulated annealing is a stochastic method but not a true optimizer. It
performs a certain number of iterations and stops, whether it found
anything or not. It cannot report anything about convergence. SANN was
dropped in the more recent optimx
package.
Brent
The Brent method calls function optimize
, which does one
parameter optimization.
Performance criteria and nonconvergence
Every optimizer uses some kind of iterative algorithm. In happy situations, the result does not depend on the starting values that you supply. A useful check is to start the optimizer from random starting points to see if you have the same result each time. Failure might be a feature of the particular problem and the data you have, but some optimizers are better than others.
Speed
R is an interpreted language, but it can make calls to compiled code in
C
or Fortran for added speed. I ask John about calls to
C
or Fortran for speeding up the optimization.
You want to spend your effort on speeding up the objective function. It is
sometimes worth putting the objective function in Fortran. The
optim()
function can accept a .Call
expression
to compiled code instead of an R function (to evaluate the objective
function). Write the optimizer in R so everyone can see the
implementation. The R package that is called microbenchmark
runs something like a hundred times and gives you the distribution of
times taken. This information is useful if you want to see where a problem
is taking time.
To make a point about speed, John types in a 10,000parameter problem on
his laptop to see how long it takes with Rcgmin
. We sit and
watch. A minute later, it returns with the correct answer. Not bad for a
laptop.
Good benchmarking is a challenge in R, John points out. As an open source project, R gets many of its contributors from academia. But benchmarking is not rewarded.
When the optimizer fails
An optimizer can fail for a number of reasons, some easier to spot than others. The iteration might reach a singular gradient and be forced to stop, or it exceeds the maximum number of iterations without converging. These conditions issue a warning to the user, but some problems are more subtle. Many statistical problems have constraints. Variances must be positive; covariance matrices must be positive definite. Often, an unconstrained optimization is done in the hope that the discovered optimum meets the constraints.
Statistical theory asserts that if the model is correct and the sample size large enough, the likelihood function achieves a maximum near the true parameters, and the log likelihood is roughly quadratic in that neighbourhood. You would think that the optimum would not be difficult to find, but you would be mistaken. The model might be false, the sample might be too small, and the function might have plateaus or local optima that confuse the optimizer.
Newtonstyle optimizers seek zeros of the derivatives, which might correspond to minima, maxima, or saddle points. As John told me, "the message is we have lots of good tools, but you can cut off your fingers pretty easily."
To avoid digital amputation, John developed the optimx
package
with Ravi Varadhan. This package improves optim
in several
ways and is its recommended replacement. Optimx
is primarily
a wrapper for a number of optimizers in R. It currently offers 12 methods,
and you can add more. Optimx
provides a standard interface to
these functions and runs as many of them as you choose. Output includes
parameter estimates, convergence status, and the value of the objective
function at the minimum.
Optimx
can also check the KarushKuhnTucker conditions to
determine whether the optimizer indeed converged to a minimum, as distinct
from a saddle point or maximum.
Optimx
also contains improved versions of some of the original
optim
functions. The new Rcgmin
replaces CG as
the conjugate gradient optimizer for problems with many parameters. SANN
was dropped because it gives no measure of convergence.
Figure 9 shows the weed data again. The result is sobering to consider. Only four of the seven methods found plausible (wellfitting) values, and none of the methods believed that it found the minimum.
Figure 9. The weed data with optimx
Reading the output
optimx
gives one row of output for each method. The estimated
parameters are given first, followed by the value
of the
function at the supposed minimum. The next three columns show the number
of times the objective function was evaluated, the number of times the
gradient function was evaluated, and the number of iterations performed
(for those methods that report iterations.)
optimx
offers several convergence codes that indicate the type
of failure, so consult the manual for details. Code 0 is good. In this
example, CG reached its maximum number of iterations (Code 1) and stopped.
A kkt1
of True means that the final gradient was close to 0
(the optimizer found an extremum), and a kkt2
of True means
that the Hessian is positive definite (it's a minimum). Both should be
True. In this example, NelderMead stopped at an extremum but did not find
a minimum—this is evident from its stopping value, which is three
orders of magnitude larger than the "minima" other methods found.
optimx
gives you much information about your data and your
problem. Certainly, if the methods fail to agree, this disagreement
suggests that the problem is a tricky one and that you should look further
to determine why. For many statistical situations, the difficulty is that
you are attempting the foolish: The model does not fit, or the sample is
too small. Under those circumstances, the likelihood function is not
guaranteed to have a strongly defined minimum at realistic parameter
values. By realistic, I mean variances greater than 0 or
probabilities between 0 and 1. But even where a problem makes sense from a
statistical standpoint, the optimization itself might be problematic.
Conclusions
Optimization is a key element in the analysis of many data sets, yet one
that statisticians and data analysts frequently ignore. We leave this
important task to the default settings of the software, often without
considering whether these settings are the best choices for our particular
problem. As I learned from John, optimx
offers several tools
to the analyst. You might need to try several algorithms and various
starting points to assess difficulties in convergence that might be
present. The KarushKuhnTucker tests can check whether a minimum was
found or something else. These methods do not guarantee complete success,
but most of the time, they can indicate the presence of problems in the
data.
Download
Description  Name  Size 

Sample R code for this article  R_code_examples.zip  4KB 
Resources
Learn
 Compact Numerical Methods for Computers: Linear Algebra and Function
Minimisation, second edition (J. C. Nash, Taylor & Francis,
1990): Explore the best resource for the methods that are used in R's
optim
andoptimx
functions. The text offers algorithms and explanations for why they work.  Statistical Inference (George Casella and Roger L. Berger. Wadsworth and Brooks, 2001): Check out this solid introduction to statistical inference, including the theory of maximum likelihood estimation.
 Convergence theorems for a class of simulated annealing algorithms on Rd (C.J.P. Belisle, Journal of Applied Probability 29:885–895, 1992): Read more about SANN in this article.
 The Comprehensive R Archive Network: Visit the main site for the R
project and each R package. The help pages and manuals that are associated
with
optimx
,nlmrt
, andRcgmin
are detailed. Numerous references are provided.  Unifying Optimization
Algorithms to Aid Software System Users: optimx for R (John Nash
and Ravi Varadhan, Journal of Statistical Software 43(9):114,
2011): Read more on
optimx
in this article.  Find the resources that you need to improve outcomes and control risk in the developerWorks Business analytics zone.
 Follow developerWorks on Twitter.
 Watch developerWorks ondemand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
Discuss
 Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Comments
Dig deeper into Big data and analytics on developerWorks

Bluemix Developers Community
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

Big data and analytics
Crazy about Big data and analytics? Sign up for our monthly newsletter and the latest Big data and analytics news.

DevOps Services
Software development in the cloud. Register today to create a project.

IBM evaluation software
Evaluate IBM software and solutions, and transform challenges into opportunities.