May 6, 2019 | Written by: IBM Research Editorial Staff
Categorized: AI | Publications
Share this post:
Editor’s note: This post is written by Xiangyi Chen (PhD candidate, University of Minnesota), Sijia Liu (Research Staff Member, MIT-IBM Watson AI Lab, IBM Research), Ruoyu Sun (Assistant Professor, University of Illinois Urbana-Champaign), and Mingyi Hong (Assistant Professor, University of Minnesota).
Choosing the right optimization algorithm for deep learning models is critical to achieve good results in a limited time budget. Adam algorithms (Kingma & Ba, 2014) are some of the most popular adaptive algorithms for training neural networks and have been widely used in deep learning applications in computer vision and natural language processing. However, it has been recently found that this method, despite its superior empirical performance, may diverge (Reddi et al., 2018). How can you tell whether Adam would work for your model?
Our recent work “On the convergence of a class of Adam-type algorithms for non-convex optimization” (authors: Xiangyi Chen (UMN), Sijia Liu (IBM), Ruoyu Sun (UIUC), Mingyi Hong (UMN)) gives a rather comprehensive answer to the above question. We study a class of adaptive gradient-based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. We provide a few conditions that guarantee convergence and show these conditions are “tight”, in the sense that we construct specific examples showing that Adam-type algorithms diverge if they violate our conditions. We also explore the question of whether these algorithms can converge for non-convex problems under the aforementioned conditions and provide an affirmative answer.
What is an Adam-type optimization method?
While well-known stochastic gradient descent (SGD) takes steps in a given direction based on the gradient alone, Adam-type optimization algorithms look at previous updates to take into account the momentum and use an adaptive learning rate. Formally, an Adam-type algorithm is of the following form:
where x is the optimization variable, gt is a stochastic gradient at step t, β1,t is a non-increasing sequence, and ht is an arbitrary function that outputs a vector having the same dimension as x. This class of algorithms is very general, and it includes most of the popular training algorithms such as SGD (Ghadimi, et al., 2013), AdaGrad (Duchi et al., 2011), AMSGrad (Reddi et al., 2018), and Adam (Kingma et al., 2014) as special cases.
When will generalized Adam algorithms converge?
Surprisingly, a flaw was discovered in the original convergence proof for Adam algorithms in very recent work by Reddi et al., who also proposed a fix to Adam called AMSGrad for convex optimization.
Our work provides the first analysis framework that guarantees the convergence of the Adam-type methods for non-convex stochastic optimization under a set of mild conditions, leading to the convergence rate 0(1/√T), where T is the number of iterations. In other words, Adam-type methods will converge to a stationary point with error in the order of 1/√T after running T iterations. We derived a set of sufficient conditions that can guarantee the convergence of Adam-type algorithms. Intuitively, conditions involve three quantities:
(a) the accumulation of effective stepsizes;
(b) the accumulation of squared effective stepsizes;
(c) the accumulation of oscillation of effective stepsizes.
Based on quantities (a)-(c), the convergence can then be guaranteed when the following two conditions are satisfied:
1) variance condition: b/a goes to 0 as T goes to infinity;
2) bias condition: c/a goes to 0 as T goes to infinity.
Both conditions admit simple and intuitive explanations. When we use gradient-based algorithms in optimization, we want the decreases in objective predicted by gradient directions to dominate increases brought by other factors. In Adam-type algorithms, three factors can lead to increases in objective: variance on gradient, higher-order curvature, and “skewed” update direction. The variance condition can quantify the possible increase in objective brought by the variance on gradients and the higher-order curvature during optimization. Limiting this quantity is a generalization of square-summable stepsizes in SGD.
The bias condition is a new condition introduced by adaptive learning rates. It essentially reflects the effect of “skewed” update directions on objective function; the update directions are “skewed” because the expected update directions can be different from the gradient directions due to use of adaptive learning rates. Surprisingly, we found the set of conditions are automatically satisfied by SGD, AdaGrad, and AMSGrad. Thus, as a byproduct of our analysis, we established the first convergence result of AdaGrad and AMSGrad in non-convex optimization, which can be of independent interest. However, Adam does not converge due to the violation of the bias condition or the variance condition.
Benefit of the proposed general convergence conditions
Our convergence conditions can be monitored during the execution of the algorithm; thus, if practitioners want to know whether a new variant of Adam will converge, they can compute these values posed by the two conditions at each iteration.
By using our conditions to monitor divergence behavior of Adam, we found the divergence of Adam is exactly due to violation of our bias or variance conditions. We construct two typical examples in which Adam can diverge. In Figures 1 and 2, we show the growth of quantities (a)-(c), which further determine the variance and bias condition for the convergence of Adam-type methods. In Figure 1, Adam diverges due to violation of the variance condition while other algorithms converge. In Figure 2, AMSGrad converges but Adam diverges because of violating the bias condition.
Figure 1. Divergence due to squared effective stepsize.
Figure 2. Divergence due to oscillation of effective stepsizes.
The above examples also imply the conditions we derived are essential and “tight”, in the sense that violating them can make an Adam-type algorithm diverge.
Apart from providing a simple and effective approach to monitor the convergence of Adam-type algorithms, we also fix the divergence issue by using a simple variant which adds momentum to only the first moment estimate while using the same second moment estimate as that of AdaGrad, which we call AdaFom (AdaGrad with First Order Moment).
Our work builds the theory to understand the behavior for a generic class of adaptive gradient methods for non-convex optimization. In particular, we provide mild sufficient conditions that guarantee the convergence for the Adam-type methods.
Join us at ICLR to hear more about our work. We’ll present on Wednesday, May 8, during the poster session from 4:30 to 6:30 pm in Great Hall BC.
Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).
Reddi, Sashank J., Satyen Kale, and Sanjiv Kumar. “On the convergence of adam and beyond.” In International Conference on Learning Representations, 2018.
Duchi, John, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of Machine Learning Research 12 Jul (2011): 2121-2159.
Chen, Xiangyi, Sijia Liu, Ruoyu Sun, and Mingyi Hong. “On the convergence of a class of adam-type algorithms for non-convex optimization.” In International Conference on Learning Representations, 2019.
Ghadimi, Saeed, and Guanghui Lan. “Stochastic first-and zeroth-order methods for nonconvex stochastic programming.” SIAM Journal on Optimization 23.4 (2013): 2341-2368.
This post is written by Xiangyi Chen (PhD candidate, University of Minnesota), Sijia Liu (Research Staff Member, MIT-IBM Watson AI Lab, IBM Research), Ruoyu Sun (Assistant Professor, University of Illinois Urbana-Champaign), and Mingyi Hong (Assistant Professor, University of Minnesota).