In a spotlight paper from the 2017 NIPS Conference, my team and I presented an AI optimization framework we call Net-Trim, which is a layer-wise convex scheme to prune a pre-trained deep neural network.
Deep learning has become a method of choice for many AI applications, ranging from image recognition to language translation. Thanks to algorithmic and computational advances, we are now able to train bigger and deeper neural networks resulting in increased AI accuracy. However, because of increased power consumption and memory usage, it is impractical to deploy such models on embedded devices with limited hardware resources and power constraints.
One practical way to overcome this challenge is to reduce the model complexity without sacrificing accuracy. The solution involves removing potentially redundant weights to make the network sparser. The well-known L1 regularization method has been widely used to efficiently discover sparse solutions of shallow networks such as linear and logistic regression. However, when applied to deeper networks, these techniques do not show any benefit, partially because the loss functions associated with deep networks are highly nonconvex, and the optimization algorithm is not able to find a good solution that is both sparse while also provides high accuracy.
A team consisting of Alireza Aghasi, a former IBM researcher and now an assistant professor at Georgia State University, Afshin Abdi at Georgia Tech and Justin Romberg, an associate professor at Georgia Tech, and myself set out to tackle this challenge. Our approach is described in the paper “Net-Trim: convex pruning of deep neural networks with performance guarantee.” When Net-trim is applied to a pre-trained network, it finds the sparsest set of weights for each layer that keeps the output responses consistent with the initial training. Using the standard L1 relaxation for sparsity, and the fact that the rectifier linear unit activation is piecewise linear, allows us to perform this search by solving a convex program.
More specifically, the training data is transmitted through the learned network layer-by-layer and, within each layer, we propose an optimization scheme which promotes weight sparsity, while enforcing consistency between the resulting response and the pre-trained network response. In a sense, if we consider each layer’s response to the transmitted data as a checkpoint, Net-Trim assures the checkpoints remain roughly the same, while a simpler path between the checkpoints is discovered. A favorable advantage of Net-Trim is the possibility of convex formulation, which can be handled via any standard convex solver.
Our approach substantially differs from recent lines of work in this area including: first, our method is mathematically provable. We have shown that the network before and after pruning by Net-Trim retains similar performance. In addition, in contrast to recent AI techniques based on thresholding, Net-Trim does not require multiple other time-consuming retraining steps after the initial pruning. Furthermore, due to the post-processing nature of our method, Net-Trim can be conveniently blended with the state-of-the-art learning techniques in neural networks. Regardless of the original process used to train the model, Net-Trim can be viewed as an additional post-processing step to reduce the model size and further improve the model’s stability and prediction accuracy.
It is important to note that, along with making the computations tractable, Net-Trim’s convex formulation also allows us to derive theoretical guarantees on how far the retrained model is from the initial model, and establish sample complexity arguments about the number of random samples required to retrain a presumably sparse layer. Net-Trim is the first pruning scheme with such performance guarantees. It is also easy to modify and adapt to other structural constraints on the weights by adding additional penalty terms or introducing additional convex constraints.
Using MNIST data, our approach can prune more than 95 percent of the weights with no loss in classification accuracy. On the other hand, 90 percent of the weights can be removed with the more difficult dataset SVHN. When comparing model size, Net-Trim can reduce the model size from 100MB to only 5MB, which makes it very efficient to deploy on mobile devices.
Interestingly, development of the human brain follows a similar paradigm whereby “pruning” of synapses is an integral component of the process of learning. Famed neurologist, Peter Richard Huttenlocher (1931-2013), through ground-breaking studies, showed that billions of synapses are formed in the human cerebral cortex during the first few months of a baby’s life. However, during subsequent years, many of these synapses are eliminated as little used connections are pruned away while keeping the functionally important ones.