AI Hardware

Ultra-Low-Precision Training of Deep Neural Networks

Share this post:

Over the past few years, reduced precision techniques have proven exceptionally effective in accelerating deep learning training and inference applications. IBM Research has played a leadership role in developing reduced precision technologies and pioneered a number of key breakthroughs, including the first 16-bit reduced-precision systems for deep learning training (presented at ICML 2015), the first 8-bit training techniques (presented recently at NeurIPS 2018), and state-of-the-art 2-bit inference results (published at SysML 2019). Following this line of work, we now introduce a new breakthrough which solves a long-ignored, yet important problem in reduced-precision deep learning: accumulation bit-width scaling for ultra-low-precision training of deep neural networks (DNNs).

The most commonly used arithmetic function in deep learning is the dot product, which is the building block of generalized matrix multiplication (GEMM) and convolution computations. A dot product is computed used multiply accumulate (MAC) operations and thus requires two floating-point computations: multiplication of 2 numbers and accumulation of the product into partial sums. Today, much of the effort on reduced-precision deep learning focuses solely on quantizing representations, i.e. input operands to the multiplication operation. The other major portion of dot product computations, i.e. the partial sum accumulation, has always kept in full (32-bit) precision. The reason is that reduced-precision accumulations can result in severe training instability and degradation in model accuracy, as shown in Fig. 1a. This is especially unfortunate, since the area and the power of the hardware is dominated by the accumulator bit-width as the precisions are aggressively reduced. As shown in Fig. 1b, accumulating in high precision severely limits the hardware benefits of reduced-precision data representations and computation. The absence of any framework to analyze the precision requirements of partial sum accumulations inevitably results in very conservative design choices.

Figure 1. The importance of accumulation precision. (a) Convergence curves of an ImageNet ResNet18 experiment using reduced precision accumulation. The current practice is to keep the accumulation in full precision to avoid such divergence. (b) Estimated area benefits when reducing the precision of a floating-point unit (FPU). The terminology FPa/b denotes an FPU whose multiplier and adder use a and b bits, respectively. Our work enables convergence in reduced precision accumulation and gains an extra 1.5–2.2× area reduction.

In our paper published at ICLR 2019, we take a step forward and present a comprehensive statistical model to analyze the impact of reduced-precision accumulation in deep learning training. We observed and learned two critical insights. First, when we accumulate a dot product in scaled precision, the loss of information is primarily due to a so-called “swamping error”. In floating point arithmetic, swamping error occurs when a large number is added to a small number, the small number will be completely or partially truncated out of the addition. Second, swamping error will harm the statistics (i.e. variance) of a dot product. To ensure stable convergence of DNNs, it is a necessity to preserve the variance of dot products under reduced precision.

Using these insights, we derived a set of equations and introduced a new metric called variance retention ratio (VRR) of a reduced-precision accumulation in the context of the three deep learning GEMM functions. The VRR is a function of the accumulation length and minimum number of bits needed for accumulation only, which needs no simulation to be computed (Fig. 2). The VRR can be used to assess the suitability, or lack thereof, of a precision configuration and allows us to determine accumulation bit-widths for precise tailoring of deep learning computation hardware. From these VRR calculations in Fig. 2, it can be easily seen that chunk-based accumulations for typical deep learning computations can preserve accuracy down to 9-bits of accumulation precision, macc, (which corresponds to fp16 accumulations) while traditional non-chunk additions need much higher macc (of up to 15-bits, corresponding to fp32 accumulations). This reduced accumulation bit-width requirement translates directly to a 1.5–2.2× improvement in hardware energy efficiency (as indicated in Fig. 1).

Figure 2. Normalized variance lost as a function of accumulation length for different values of accumulation bit-width, macc, for (a) a normal accumulation (no chunking) and (b) a chunk-based accumulation (chunk size of 64). The ”knees” in each plot correspond to the maximum accumulation length for a given precision which indicates how the VRR is to be used to select a suitable precision.

Using the analysis, we successfully predicted and experimentally verified the minimum accumulation precisions required by the three GEMM functions across three popular benchmarking networks (CIFAR-10 ResNet34, ImageNet ResNet18, and ImageNet AlexNet). Our results prove that our method is able to accurately pinpoint the minimum precision needed for the convergence of benchmark networks to the full-precision baseline. On the practical side, this analysis is a useful tool for hardware designers implementing reduced precision processing hardware. We believe this work addresses a critical missing link on the path to ultra-low-precision hardware for DNN training.

Research Staff Member, IBM Research

Chia-Yu Chen

Research Staff Member, IBM Research

Kailash Gopalakrishnan

IBM Fellow and Senior Manager, Accelerator Architectures and Machine Learning, IBM Research

More AI Hardware stories

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading

This ship has no crew and it will transform our understanding of the ocean. Here’s how

IBM is supporting marine research organization ProMare to provide the technologies for the Mayflower Autonomous Ship (MAS). Named after another famous ship from history but very much future focussed, the new Mayflower uses AI and energy from the sun to independently traverse the ocean, gathering vital data to expand our understanding of the factors influencing its health.

Continue reading