June 13, 2019 | Written by: Irina Rish
Share this post:
Since its introduction in the 1970s, the backpropagation algorithm (backprop) has been the workhorse for training neural networks has and contributed to impressive successes in deep learning for a wide range of applications. Backprop plays an important role in enabling neural networks to track the errors they make, learn from those mistakes and improve over time.
However, backprop also suffers from a number of flaws, including several well-known computational issues such as the vanishing gradient problem: as the networks get deeper, the gradients of the loss function may start approaching zero, making the network hard to train.
Other limitations of backprop include its inability to handle non-differentiable nonlinearities, e.g. in binary neural networks, which is important for memory- and energy-efficient computing, especially in mobile devices that have limited hardware resources. Furthermore, the sequential nature of backprop (i.e., chain-rule differentiation) does not across networks layers. Doing so could speed up computation considerably, especially in very deep or recurrent networks. Finally, backprop is often criticized as a biologically implausible learning mechanism that does not explicitly model neural activity. Backprop uses non-local synaptic updates and has several other properties that do not conform to known biological mechanisms of learning.
Various limitations of backprop continue to motivate the exploration of alternative neural net learning methods. In fact, one of its creators previously said he is “deeply suspicious of back-propagation’’ and his view is “throw it all away and start again.”
Our study, “Beyond Backprop: Online Alternating Minimization with Auxiliary Variables,” in collaboration with NYU and MIT, presented this week at the 2019 ICML conference, proposes a novel alternative to backprop. This new approach shifts the focus towards explicit propagation of neuronal activity by introducing noisy “auxiliary variables,” which break the “gradient chain” into local, independent, layer-wise weight updates that can be done in a parallel manner.
The paper provides novel theoretical convergence guarantees for a general class of online alternating optimization methods. Promising empirical results using multiple datasets and network architectures demonstrate that the new approach can perform on par with the state-of-art stochastic gradient descent (SGD) implementations of backprop and often learns faster initially, when only a small amount of data is available for training.
Our goal initially is not to outperform backprop, but rather to explore alternative learning methods that show competitive performance and, more importantly, and provide new and useful properties that backprop lacks. In our method, such properties are a natural consequence of breaking backprop’s gradient chains into simpler, local optimization problems. As a result, we get parallel/asynchronous weight updates, elimination of vanishing gradients and easier ways of handling non-differentiable nonlinearities, which enables more energy-efficient computation in binary networks.
Auxiliary-variable methods such as the one developed in this study are also a step closer to biologically plausible learning mechanisms, due to their explicit propagation of neural activity and local synaptic updates.
Beyond Backprop: Online Alternating Minimization with Auxiliary Variables, Anna Choromanska, Benjamin Cowen, Sadhana Kumaravel, Ronny Luss, Mattia Rigotti, Irina Rish, Brian Kingsbury, Paolo DiAchille, Viatcheslav Gurev, Ravi Tejwani, Djallel Bouneffouf