# Optimization in R

A conversation with John Nash about optim and optimx

Statistics is nothing if not an exercise in optimization, beginning with the sample average and moving on from there. The average of a group of numbers minimizes the sum of squared deviations. In other words, the average minimizes the sum of terms like (x_i-avg)^2. The median minimizes the sum of absolute deviations—terms like |x_i-med|.

The underlying strategy for most statistical reasoning is:

1. Write down a probability model that should account for the data.
2. This model will contain some unknown constants, or parameters.
3. Collect the data.
4. Find values of the parameters that best account for the data.

That last point contains the optimization. Maximum likelihood is an optimization procedure that selects the most plausible parameter values for the data you got. Parameters can be estimated in a number of ways, but all of them involve an optimization.

Interestingly, you can optimize most of the models taught in a beginning statistics course with a closed-form expression. Mathematically, finding the mean and variance of normal data, estimating a proportion, fitting a regression line, and modeling treatment and block effects in experimental design are all optimization problems. Their solutions emerge as the solution of a system of linear equations. The statistician hands over the problem to a software package, trusting that the linear algebra algorithm will produce stable estimates of the parameters.

For just about any model outside of the linear models I mentioned, closed-form solutions do not exist. You need not go far to find practical examples. Consider a bank that wants to predict whether prospective customers will default on their mortgages. The bank's historical data show the default status (a binary outcome coded Yes-No) of numerous customers, together with personal and financial information, such as employment status, years with current employer, salary, marital status. Estimating the contributions of each of these pieces of information boils down to estimating the parameters of a logistic equation. No closed-form expression exists for the best parameters of this model, so a numerical procedure is required to estimate the parameters and optimize the likelihood.

The same is true of other generalized linear models, structural equation models, and many other models that are used in modern statistics. Most of the heavy lifting in these problems, and the software that delivers the results, relies on a numerical optimization algorithm.

The usual graduate program in statistics, even at a good school, teaches you a lot about the theoretical properties of these estimates. And the theory, to be sure, paints an optimistic picture. Theory shows that under certain conditions, most people forget with time, and given a large enough sample size, the solution to the optimization problem (the maximum likelihood estimate) is the solution to the estimation problem. Solving the maximum likelihood problem gives you the estimates necessary to complete specification of the model. Even better, the estimates behave nicely (are normally distributed). You can even estimate how accurate they are from the curvature of the maximum likelihood function near its optimum value. At least, if the sample size is large enough, these nice properties will hold.

You then hand the problem over to an optimization algorithm, confident that your work is done. But is that confidence well placed?

## A logistic growth curve model

Consider the example in Figure 1. Twelve observations were made in successive time periods, and a logistic growth curve was fit to the data.

##### Figure 1. Logistic fit to weed data  The model clearly fits well, and you might expect that the optimization went smoothly. The example comes from the help file of the `nlmrt` function, also written by John Nash. The data is from a study on weeds that was carried out at Agriculture Canada. This simple data set manages to break most optimizers sent its way. Parameter estimates were obtained by using `nlmrt`, but the popular and traditional `nls` function fails to converge on this data set. Function `nls` and the functions in `nlmrt` are designed for the special case of minimizing a nonlinear least squares objective function. The code and the output for this example are shown in Figure 2. An R script is available from Downloadable resources.

##### Figure 2. The weed data with nlmrt  Now, Figure 3 shows what happens by using `nls`. It fails to converge.

##### Figure 3. The weed data with nls  ## Optim

R has many optimizers to handle special cases, but for general-purpose optimization, many people choose `optim`. This function is part of Base-R. Not long ago, I ran into `optim` author John Nash at an Ottawa R user meet-up group. The subject of optimization and `optim` came up, and John announced with enthusiasm, "I wrote it; but I don't want you to use it." I was intrigued and wanted to learn more. John agreed to answer my questions about it, and a few days later, we sat down in a Timothy's cafe with our lattes and laptops. This is the substance of our conversation.

No relation to beautiful mind game theorist John Nash, optimizer Nash (see Figure 4) is a Canadian mathematician. He earned a D.Phil. from Oxford in 1972 from the Mathematics Institute under Charles Coulson. He then worked at Agriculture Canada in Ottawa until 1980. While there, he wrote the routines that later became part of the `optim` function. The routines were written in Basic, with the constraint that program and data take no more than 8 KB of memory. Yes, you read that correctly: 8 kilobytes. This was the 1970s. Based on this experience, John wrote Compact Numerical Methods for Computers: Linear Algebra and Function Minimization, published in 1979. A second edition (1990) of this book is still in print, which much be something of a record for a book of computer algorithms. He joined the faculty of the Telfer School of Management in 1981 (University of Ottawa) and remained there until his retirement in 2005. He continues to be active in consulting and R programming and is the maintainer of several R packages on optimization, including `optim` and `optimx`.

##### Figure 4. John Nash in traditional English costume Catherine
Tell me something about the optimization functions in R.
John
The first thing you need to understand is that there are two classes of optimization problems: function minimization and mathematical programming. Both seek an optimum, and both involve constraints, but mathematical programming begins with the constraints. It emphasizes solving the constraints, and then looks for the best solution. Take a look at Mike Trick's website. Every year, he optimizes the team schedule for major league baseball. The goal is to maximize attendance, but many constraints exist. The teams can't be on the road for more than seven days at a time, and so on. Function minimization starts with the objective, and then worries about satisfying the constraints. Many problems can be solved using either approach. A huge amount of commercial effort is going into math programming solvers these days. Most of the optimizers in R are function minimizers.
Catherine
I get that. In classical statistics, function minimization is usually what you need to do, either for maximum likelihood or least squares.

I confessed that most of what I knew about optimization came from the day in first-year calculus that covered the Newton-Raphson method. John was not surprised. Optimization is a black box for many statisticians who know little about it, and he set out to put me straight.

John
Newton will take you a long way, and many successful algorithms still follow gradients in their search for the optima of the objective function.

John took a sheet from my notebook and made a quick sketch (see Figure 5).

##### Figure 5. Iteration step in Newton-Raphson  Most optimizers like to be consistent: They recast all problems as either a search for the minimum or a search for the maximum. John likes minima. If the objective function is continuous, he explained, then the minimum occurs where the derivative is 0. So, the optimization problem becomes a problem of finding a zero. With Newton-Raphson, you pick a starting point, S_0, then follow the tangent down to the x-axis. Repeat with the new starting point. If all goes well, it converges quickly. The problem—and John made another sketch (see Figure 6)—is what happens in this case.

##### Figure 6. Wrong direction  "It takes three lines of code to write the iteration step, and 3000 lines to avoid getting stuck or heading off in the wrong direction. The bad thing about optimization is that everyone puts together quasi-Newton methods that are supposed to be better, but they only test the new method on the method where X (the old method) failed, not on the examples were X works. So, there is an ongoing process of re-inventing the wheel."

Catherine
How did you come to write the `optim` function?
John
I wrote three of the routines that figure in `optim` in the 1970s in Basic. In 1990, for the second edition of the book, the codes were in Pascal. In 1995, I was in Oxford and went to South Parks Road to say hello to Brian Ripley. He asked if he could use my codes in R. I said yes. Ripley packaged the codes and named it `optim`. He included the `CG` function, because he needed something to handle large numbers of parameters —say, around 50,000.

## The optim function bundles six different optimization methods

The user can select one of six different optimization methods:

• Broyden-Fletcher-Goldfarb-Shanno (BFGS)
• Limited-memory BFGS (L-BFGS-B)
• Simulated annealing (SANN)
• Brent

For a complete description, see Related topics.

The default method is to run a Nelder-Mead simplex algorithm. The method does not require gradients and relies only on evaluations of the function itself. The idea is fairly intuitive. To minimize a function of n parameters, construct a simplex of n+1 vertices. Evaluate the function at each vertex. For two parameters, the simplex is a triangle (see Figure 7).

##### Figure 7. Nelder-Mead simplex in two dimensions  The function is evaluated at each vertex. Suppose that the highest values are at H, the lowest at L, with P taking on a value in between. The idea is to "walk" the simplex around the plane until it locates the minimum.

1. Replace H by one of the four candidate points: C1, C2, C3, or C4.
2. The candidates lie along the line that joins H to the centroid (mid-point) of L and P.
3. Evaluate the function at C1.

If this value is smaller than L, then this line looks like a good line of descent. Look further at C2.

4. Check the value at C2.

If this value is the new minimum, replace H with C2. The simplex will stretch along that direction.

5. Otherwise, replace H with C1.
6. If the value at C1 is between the values at L and P, replace H with C1.
7. Suppose that C1 is greater than P. If it is lower than H, replace H with C3 (low side). If C1 marks a new high point, replace H with C4 (high side).

In both of these cases, the simplex is shrunk down.

In the three-parameter case, the simplex is a tetrahedron that rolls around three-dimensional (3D) space. Duncan Murdoch wrote a nice visualization with his 3D package, `rgl` (see Figure 8). The objective function is a mixture of three normal distributions, and the Nelder-Mead seeks a mode. An R script (the Weeds file in the Code folder) for the code is available in Downloadable resources.

##### Figure 8. A Nelder-Mead simulation for a three-parameter problem  The Nelder-Mead method is a direct search method. No gradients are required. It can be slow but is usually reliable, making it a good default, non-customized option.

### BFGS, CG, and L-BFGS-B

These methods use gradients. The user can supply code to calculate the gradient, or gradients can be calculated from function evaluations. In John's experience, "analytic derivatives usually but not always give more accurate solutions much more efficiently." BFGS and L-BFGS-B are quasi-Newton methods. L-BFGS-B optimizes in the presence of box constraints.

CG is a conjugate gradient method. Conjugate gradient methods work by forming a sequence of quadratic approximations to the function (by using the Hessian). Each quadratic is optimized by seeking through n orthogonal directions (where orthogonality is measured with respect to the Hessian). A linear search restarts the quadratic approximation in a better location until convergence is obtained.

### SANN

Simulated annealing is a stochastic method but not a true optimizer. It performs a certain number of iterations and stops, whether it found anything or not. It cannot report anything about convergence. SANN was dropped in the more recent `optimx` package.

### Brent

The Brent method calls function `optimize`, which does one parameter optimization.

## Performance criteria and nonconvergence

Every optimizer uses some kind of iterative algorithm. In happy situations, the result does not depend on the starting values that you supply. A useful check is to start the optimizer from random starting points to see if you have the same result each time. Failure might be a feature of the particular problem and the data you have, but some optimizers are better than others.

### Speed

R is an interpreted language, but it can make calls to compiled code in `C` or Fortran for added speed. I ask John about calls to `C` or Fortran for speeding up the optimization.

You want to spend your effort on speeding up the objective function. It is sometimes worth putting the objective function in Fortran. The `optim()` function can accept a `.Call` expression to compiled code instead of an R function (to evaluate the objective function). Write the optimizer in R so everyone can see the implementation. The R package that is called `microbenchmark` runs something like a hundred times and gives you the distribution of times taken. This information is useful if you want to see where a problem is taking time.

To make a point about speed, John types in a 10,000-parameter problem on his laptop to see how long it takes with `Rcgmin`. We sit and watch. A minute later, it returns with the correct answer. Not bad for a laptop.

Good benchmarking is a challenge in R, John points out. As an open source project, R gets many of its contributors from academia. But benchmarking is not rewarded.

### When the optimizer fails

An optimizer can fail for a number of reasons, some easier to spot than others. The iteration might reach a singular gradient and be forced to stop, or it exceeds the maximum number of iterations without converging. These conditions issue a warning to the user, but some problems are more subtle. Many statistical problems have constraints. Variances must be positive; covariance matrices must be positive definite. Often, an unconstrained optimization is done in the hope that the discovered optimum meets the constraints.

Statistical theory asserts that if the model is correct and the sample size large enough, the likelihood function achieves a maximum near the true parameters, and the log likelihood is roughly quadratic in that neighbourhood. You would think that the optimum would not be difficult to find, but you would be mistaken. The model might be false, the sample might be too small, and the function might have plateaus or local optima that confuse the optimizer.

Newton-style optimizers seek zeros of the derivatives, which might correspond to minima, maxima, or saddle points. As John told me, "the message is we have lots of good tools, but you can cut off your fingers pretty easily."

To avoid digital amputation, John developed the `optimx` package with Ravi Varadhan. This package improves `optim` in several ways and is its recommended replacement. `Optimx` is primarily a wrapper for a number of optimizers in R. It currently offers 12 methods, and you can add more. `Optimx` provides a standard interface to these functions and runs as many of them as you choose. Output includes parameter estimates, convergence status, and the value of the objective function at the minimum.

`Optimx` can also check the Karush-Kuhn-Tucker conditions to determine whether the optimizer indeed converged to a minimum, as distinct from a saddle point or maximum.

`Optimx` also contains improved versions of some of the original `optim` functions. The new `Rcgmin` replaces CG as the conjugate gradient optimizer for problems with many parameters. SANN was dropped because it gives no measure of convergence.

Figure 9 shows the weed data again. The result is sobering to consider. Only four of the seven methods found plausible (well-fitting) values, and none of the methods believed that it found the minimum.

##### Figure 9. The weed data with optimx  `optimx` gives one row of output for each method. The estimated parameters are given first, followed by the `value` of the function at the supposed minimum. The next three columns show the number of times the objective function was evaluated, the number of times the gradient function was evaluated, and the number of iterations performed (for those methods that report iterations.)

`optimx` offers several convergence codes that indicate the type of failure, so consult the manual for details. Code 0 is good. In this example, CG reached its maximum number of iterations (Code 1) and stopped. A `kkt1` of True means that the final gradient was close to 0 (the optimizer found an extremum), and a `kkt2` of True means that the Hessian is positive definite (it's a minimum). Both should be True. In this example, Nelder-Mead stopped at an extremum but did not find a minimum—this is evident from its stopping value, which is three orders of magnitude larger than the "minima" other methods found.

`optimx` gives you much information about your data and your problem. Certainly, if the methods fail to agree, this disagreement suggests that the problem is a tricky one and that you should look further to determine why. For many statistical situations, the difficulty is that you are attempting the foolish: The model does not fit, or the sample is too small. Under those circumstances, the likelihood function is not guaranteed to have a strongly defined minimum at realistic parameter values. By realistic, I mean variances greater than 0 or probabilities between 0 and 1. But even where a problem makes sense from a statistical standpoint, the optimization itself might be problematic.

## Conclusions

Optimization is a key element in the analysis of many data sets, yet one that statisticians and data analysts frequently ignore. We leave this important task to the default settings of the software, often without considering whether these settings are the best choices for our particular problem. As I learned from John, `optimx` offers several tools to the analyst. You might need to try several algorithms and various starting points to assess difficulties in convergence that might be present. The Karush-Kuhn-Tucker tests can check whether a minimum was found or something else. These methods do not guarantee complete success, but most of the time, they can indicate the presence of problems in the data.