AI Hardware

Ultra-Low-Precision Training of Deep Neural Networks

Share this post:

Over the past few years, reduced precision techniques have proven exceptionally effective in accelerating deep learning training and inference applications. IBM Research has played a leadership role in developing reduced precision technologies and pioneered a number of key breakthroughs, including the first 16-bit reduced-precision systems for deep learning training (presented at ICML 2015), the first 8-bit training techniques (presented recently at NeurIPS 2018), and state-of-the-art 2-bit inference results (published at SysML 2019). Following this line of work, we now introduce a new breakthrough which solves a long-ignored, yet important problem in reduced-precision deep learning: accumulation bit-width scaling for ultra-low-precision training of deep neural networks (DNNs).

The most commonly used arithmetic function in deep learning is the dot product, which is the building block of generalized matrix multiplication (GEMM) and convolution computations. A dot product is computed used multiply accumulate (MAC) operations and thus requires two floating-point computations: multiplication of 2 numbers and accumulation of the product into partial sums. Today, much of the effort on reduced-precision deep learning focuses solely on quantizing representations, i.e. input operands to the multiplication operation. The other major portion of dot product computations, i.e. the partial sum accumulation, has always kept in full (32-bit) precision. The reason is that reduced-precision accumulations can result in severe training instability and degradation in model accuracy, as shown in Fig. 1a. This is especially unfortunate, since the area and the power of the hardware is dominated by the accumulator bit-width as the precisions are aggressively reduced. As shown in Fig. 1b, accumulating in high precision severely limits the hardware benefits of reduced-precision data representations and computation. The absence of any framework to analyze the precision requirements of partial sum accumulations inevitably results in very conservative design choices.

Figure 1. The importance of accumulation precision. (a) Convergence curves of an ImageNet ResNet18 experiment using reduced precision accumulation. The current practice is to keep the accumulation in full precision to avoid such divergence. (b) Estimated area benefits when reducing the precision of a floating-point unit (FPU). The terminology FPa/b denotes an FPU whose multiplier and adder use a and b bits, respectively. Our work enables convergence in reduced precision accumulation and gains an extra 1.5–2.2× area reduction.

In our paper published at ICLR 2019, we take a step forward and present a comprehensive statistical model to analyze the impact of reduced-precision accumulation in deep learning training. We observed and learned two critical insights. First, when we accumulate a dot product in scaled precision, the loss of information is primarily due to a so-called “swamping error”. In floating point arithmetic, swamping error occurs when a large number is added to a small number, the small number will be completely or partially truncated out of the addition. Second, swamping error will harm the statistics (i.e. variance) of a dot product. To ensure stable convergence of DNNs, it is a necessity to preserve the variance of dot products under reduced precision.

Using these insights, we derived a set of equations and introduced a new metric called variance retention ratio (VRR) of a reduced-precision accumulation in the context of the three deep learning GEMM functions. The VRR is a function of the accumulation length and minimum number of bits needed for accumulation only, which needs no simulation to be computed (Fig. 2). The VRR can be used to assess the suitability, or lack thereof, of a precision configuration and allows us to determine accumulation bit-widths for precise tailoring of deep learning computation hardware. From these VRR calculations in Fig. 2, it can be easily seen that chunk-based accumulations for typical deep learning computations can preserve accuracy down to 9-bits of accumulation precision, macc, (which corresponds to fp16 accumulations) while traditional non-chunk additions need much higher macc (of up to 15-bits, corresponding to fp32 accumulations). This reduced accumulation bit-width requirement translates directly to a 1.5–2.2× improvement in hardware energy efficiency (as indicated in Fig. 1).

Figure 2. Normalized variance lost as a function of accumulation length for different values of accumulation bit-width, macc, for (a) a normal accumulation (no chunking) and (b) a chunk-based accumulation (chunk size of 64). The ”knees” in each plot correspond to the maximum accumulation length for a given precision which indicates how the VRR is to be used to select a suitable precision.

Using the analysis, we successfully predicted and experimentally verified the minimum accumulation precisions required by the three GEMM functions across three popular benchmarking networks (CIFAR-10 ResNet34, ImageNet ResNet18, and ImageNet AlexNet). Our results prove that our method is able to accurately pinpoint the minimum precision needed for the convergence of benchmark networks to the full-precision baseline. On the practical side, this analysis is a useful tool for hardware designers implementing reduced precision processing hardware. We believe this work addresses a critical missing link on the path to ultra-low-precision hardware for DNN training.

Research Staff Member, IBM Research

Chia-Yu Chen

Research Staff Member, IBM Research

Kailash Gopalakrishnan

Distinguished Research Staff Member, IBM Research

More AI Hardware stories

Biologically-Inspired Deep Learning Predicts Chords of Bach

Today, as reported in Nature Machine Intelligence, my colleagues and I have demonstrated a novel approach to deep learning that incorporates biologically-inspired neural dynamics and enables in-memory acceleration, bringing it closer to the way in which the human brain works. The results point towards the broad adoption of more biologically-realistic deep learning for applications in […]

Continue reading

Fulfilling Brain-inspired Hyperdimensional Computing with In-memory Computing

Scientists around the world are inspired by the brain and strive to mimic its abilities in the development of technology. Our research team at IBM Research Europe in Zurich shares this fascination and took inspiration from the cerebral attributes of neuronal circuits like hyperdimensionality to create a novel in-memory hyperdimensional computing system.   The most […]

Continue reading

Iso-accuracy Deep Learning Inference with In-memory Computing

Can analog AI hardware support deep learning inference without compromising accuracy? Our research team at IBM Research Europe in Zurich thought so when we started developing a groundbreaking technique that achieves both energy efficiency and high accuracy on deep neural network computations using phase-change memory devices. We believe this could be a way forward in […]

Continue reading