AI Hardware

Highly Accurate Deep Learning Inference with 2-bit Precision

Share this post:

Three months ago, at NeurIPS 2019, we presented a robust 8-bit platform that toppled the 16-bit barrier for deep learning training while fully preserving model accuracy. Today, we’re very excited to be sharing new results that push the envelope for deep learning inference, enabling model deployment to work with high accuracy down to 2-bit precision.

IBM Research has pioneered the pursuit of reduced precision and approximate computing techniques. In the 2015 seminal paper at the International Conference on Machine Learning (ICML), we demonstrated a quadratic improvement in performance efficiency by reducing precision from 32-bit to 16-bit for both training and deployment. Our approach has been broadly adopted, while we have continued to lead with innovations in approximate computing and to push the boundaries of reduced precision.

At the 2019 SysML conference, we share new results that transcend the leading edge of 8-bit precision for deep learning training: our new activation technique to create a quantized neural network (QNN) achieves high accuracy for 4-bit and 2-bit precision computations, generating multiplicative gains in performance and efficiency. These benefits translate directly to performance, power consumption, and cost for chips performing AI inferencing in edge devices, as shown in Figure 1.

FIGURE 1: The benefits of moving from 32-bit to 2-bit precision for deep learning inference can be exploited for power consumption or performance, using an example workload of classifying 20,000 images.

Reducing precision down to 2 bits has been extensively pursued, with a wide spectrum of quantization techniques. But previous attempts have failed to match the accuracy of 8-bit or 16-bit precision hardware, particularly for convolutional neural networks (CNNs).

Quantized neural networks (QNNs) are formed by quantizing weights and activations to minimize the CNN computation and storage costs. The accuracy degradation with quantization is particularly challenging for CNNs, limited by the ReLU (Rectified Linear Unit) activation function, which inherently relies on high dynamic range and precision to capture the unbounded continuous range of activation values. Collapsing the weights and the unlimited range of ReLU-generated activations into four discrete bins, as required for 2-bit inference computations, causes large accuracy losses for deep learning inference.

In PArameterized Clipping acTivation (PACT), a maximum activation value is automatically derived for the unlimited range of activation values as part of the model training itself. This reduced, fixed dynamic range is then “quantized”, creating four discrete activation values, enabling high accuracy to be retained with 2-bit inference, as shown in Figures 2 and 3. Deriving the maximum activation value, or clipping coefficient, is incorporated in the model training cycles, without any penalty to model training times. The derived clipping coefficients are then incorporated into the deployed inference model with minimal impact to model accuracy when compared to 8-bit or 16-bit inference on CNNs.

FIGURE 2: Convolutional neural networks (CNNs) are commonly used for image recognition tasks. A given neuron in the network is activated based on a function applied to the sum of incoming signals. (a) In ReLU activation, there is a continuous and potentially infinite range of activation values. Collapsing the infinite range into four discrete bins, as required for 2-bit calculations, creates large accuracy loss. (b) In PACT activation, a max value is derived for the dynamic range of activation values during the model training. The reduced, fixed dynamic range is then “quantized”, creating four discrete activation values, enabling high accuracy to be retained with 2-bit inference.


FIGURE 3: Train error (left) and validation error (right) when the activation of CIFAR10 ResNet10 with ReLU or clipping activation function (clipping level = 1.0) is quantized to 2 bits. ReLU is more sensitive to activation quantization due to its large dynamic range.

In addition, we also introduce a novel quantization scheme for weights: statistics-aware weight binning (SAWB). The main idea in SAWB is to exploit both first and second moments of the neural network weight distribution to minimize the weight quantization error for the deep network. These statistics help the quantized states capture the shape changes in the weight distribution. Combining the PACT and SAWB advances allows us to perform deep learning inference computations with high accuracy down to 2-bit precision.

Our work is part of the Digital AI Core research featured in the recently announced IBM Research AI Hardware Center. Beyond Digital AI Cores, our AI hardware roadmap extends to the new devices and materials we are introducing in our Analog AI Cores, which will drive AI hardware leadership through the next decade, building on a common architecture and software foundation and further exploiting our algorithmic advances, such as PACT.

We present this paper, “Accurate and Efficient 2-bit Quantized Neural Networks” at SysML in Stanford, CA, today.

Research Staff Member, IBM Research

Swagath Venkataramani

Research Staff Member, IBM Research

More AI Hardware stories

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading

This ship has no crew and it will transform our understanding of the ocean. Here’s how

IBM is supporting marine research organization ProMare to provide the technologies for the Mayflower Autonomous Ship (MAS). Named after another famous ship from history but very much future focussed, the new Mayflower uses AI and energy from the sun to independently traverse the ocean, gathering vital data to expand our understanding of the factors influencing its health.

Continue reading