Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

Share this post:

Since its introduction in the 1970s, the backpropagation algorithm (backprop) has been the workhorse for training neural networks has and contributed to impressive successes in deep learning for a wide range of applications. Backprop plays an important role in enabling neural networks to track the errors they make, learn from those mistakes and improve over time.

However, backprop also suffers from a number of flaws, including several well-known computational issues such as the vanishing gradient problem: as the networks get deeper, the gradients of the loss function may start approaching zero, making the network hard to train.

Other limitations of backprop include its inability to handle non-differentiable nonlinearities, e.g. in binary neural networks, which is important for memory- and energy-efficient computing, especially in mobile devices that have limited hardware resources. Furthermore, the sequential nature of backprop (i.e., chain-rule differentiation) does not across networks layers.  Doing so could speed up computation considerably, especially in very deep or recurrent networks. Finally, backprop is often criticized as a biologically implausible learning mechanism that does not explicitly model neural activity.  Backprop uses non-local synaptic updates and has several other properties that do not conform to known biological mechanisms of learning.

Various limitations of backprop continue to motivate the exploration of alternative neural net learning methods. In fact, one of its creators previously said he is “deeply suspicious of back-propagation’’ and his view is “throw it all away and start again.”

Our study, “Beyond Backprop: Online Alternating Minimization with Auxiliary Variables,” in collaboration with NYU and MIT, presented this week at the 2019 ICML conference, proposes a novel alternative to backprop. This new approach shifts the focus towards explicit propagation of neuronal activity by introducing noisy “auxiliary variables,” which break the “gradient chain” into local, independent, layer-wise weight updates that can be done in a parallel manner.


The paper provides novel theoretical convergence guarantees for a general class of online alternating optimization methods. Promising empirical results using multiple datasets and network architectures demonstrate that the new approach can perform on par with the state-of-art stochastic gradient descent (SGD) implementations of backprop and often learns faster initially, when only a small amount of data is available for training.


Our goal initially is not to outperform backprop, but rather to explore alternative learning methods that show competitive performance and, more importantly, and provide new and useful properties that backprop lacks. In our method, such properties are a natural consequence of breaking backprop’s gradient chains into simpler, local optimization problems. As a result, we get  parallel/asynchronous weight updates, elimination of vanishing gradients and easier ways of handling non-differentiable nonlinearities, which enables more energy-efficient computation in binary networks.


Auxiliary-variable methods such as the one developed in this study are also a step closer to biologically plausible learning mechanisms, due to their explicit propagation of neural activity and local synaptic updates.

Beyond Backprop: Online Alternating Minimization with Auxiliary Variables, Anna Choromanska, Benjamin Cowen, Sadhana Kumaravel, Ronny Luss, Mattia Rigotti, Irina Rish, Brian Kingsbury, Paolo DiAchille, Viatcheslav Gurev, Ravi Tejwani, Djallel Bouneffouf

Staff Member, IBM Research

More AI stories

Pushing the boundaries of convex optimization

Convex optimization problems, which involve the minimization of a convex function over a convex set, can be approximated in theory to any fixed precision in polynomial time. However, practical algorithms are known only for special cases. An important question is whether it is possible to develop algorithms for a broader subset of convex optimization problems that are efficient in both theory and practice.

Continue reading

Making Neural Networks Robust with New Perspectives

IBM researchers have partnered with scientists from MIT, Northeastern University, Boston University and University of Minnesota to publish two papers on novel attacks and defenses for graph neural networks and on a new robust training algorithm called hierarchical random switching at IJCAI 2019.

Continue reading

Improving the Scalability of AI Planning when Memory is Limited

We report new research results relevant to AI planning in our paper, "Depth-First Memory-Limited AND/OR Search and Unsolvability in Cyclic Search Spaces," presented at the International Joint Conference on Artificial Intelligence, IJCAI-19.

Continue reading