The state-of-the-art hardware platforms for training deep neural networks (DNNs) are moving from traditional single precision (32-bit) computations towards 16-bit precision, in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations. In the Conference on Neural Information Processing Systems (NeurIPS) paper “Training Deep Neural Networks with 8-bit Floating Point Numbers” (authors: Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, Kailash Gopalakrishnan) we demonstrate, for the first time, the successful training of DNNs using 8-bit floating point numbers (FP8) while fully maintaining the accuracy on a spectrum of deep learning models and datasets. The associated performance and energy improvement will serve as the next boost to AI model development, providing the hardware to rapidly train and deploy broad AI at the data center and the edge.
Digital AI accelerators with approximate computing
Historically, high-performance computing has relied on high precision 64- and 32-bit floating point arithmetic. This approach delivers accuracy critical for scientific computing tasks like simulating the human heart or calculating space shuttle trajectories. But do we need this level of accuracy for common perception and reasoning tasks such as speech recognition, image classification and language translation? The answer is that many of these tasks (accomplished today using deep learning) can be computed effectively with approximate techniques.
Since full precision is rarely required for common deep learning workloads, reduced precision is a natural direction. Computational building blocks with 16-bit precision engines are typically 4 times smaller than comparable blocks with 32-bit precision; this gain in area efficiency directly translates into a significant boost in performance and power efficiency for both AI training and inference workloads. Simply stated, in approximate computing, we can trade numerical precision for computational throughput enhancements, provided we also develop algorithmic improvements to preserve model accuracy. This approach also complements other approximate computing techniques—including our recent work that describes novel training compression approaches to cut communications overhead, leading to 40-200x speedup over existing compression methods.
The next platform for AI training
Over the past 5 years, IBM Research has pioneered a number of key breakthroughs that enable reduced-precision techniques to work exceptionally well for both deep learning training and inference applications. First introduced in 2015 at the International Conference on Machine Learning (ICML), we demonstrated that the precision of deep learning training systems could be reduced from 32 bits to 16 bits and still fully preserve model accuracy. We also demonstrated that dataflow chip architectures could be used to fully harness the benefits of scaled precision in hardware. In the ensuing years, the reduced precision approach was quickly adopted as the industry standard, with 16-bit training and 8-bit inference systems now commonplace. Building on this leadership, IBM researchers have now broken through the next big barriers for low precision, achieving 8-bit precision for training and 4-bit precision for inference, across a range of deep learning datasets and neural networks.
Breaking through the 16-bit barrier for training
Unlike inference, training with numbers represented with less than 16 bits has been very challenging due to the need to maintain fidelity of the gradient computations and weight updates during back-propagation.
There are three primary challenges that make it difficult to scale precision below 16 bits while fully preserving model accuracy. Firstly, when all the operands (i.e., weights, activations, errors, and gradients) for general matrix multiplication (GEMM) and convolution computations are simply reduced to 8 bits, most DNNs suffer noticeable accuracy degradation. Secondly, reducing the bit precision of accumulations in GEMM from 32 bits to 16 bits significantly impacts the convergence of DNN training. This is why commercially available hardware platforms exploiting scaled precision for training (including GPUs) still continue to use 32 bits of precision for accumulation. Reducing accumulation bit precision below 32 bits is critically important for reducing the area and power of 8-bit hardware. Finally, reducing the bit precision of weight updates to 16-bit floating-point impacts accuracy, while 32-bit weight updates, used in today’s systems, require an extra copy of the high-precision weights and gradients to be kept in memory, which is expensive.
Although this was previously considered impossible, we have introduced new ideas and techniques to overcome all of these challenges (and orthodoxies) associated with reducing training precision below 16 bits:
- Devised a new 8-bit floating-point (FP8) format that, in combination with DNN training insights on precision setting for the first and last layers of a deep network, allows GEMM and convolution computations for deep learning to work without loss in model accuracy.
- Developed a new technique called chunk-based computations that when applied hierarchically allows all matrix and convolution operations to be computed using only 8-bit multiplications and 16-bit additions (instead of 16 and 32 bits, respectively).
- Applied floating-point stochastic rounding in the weight update process, allowing these updates to be computed with 16 bits of precision (instead of 32 bits).
- Demonstrated the wide applicability of the combined effects of these techniques across a suite of deep learning models and datasets while fully preserving model accuracy.
Using the techniques listed above, we have achieved model accuracy on par with FP32 across a spectrum of models and datasets, as shown in Figure 2.
One of the biggest challenges towards scaling accumulation precision for training systems is the loss of information that occurs when we accumulate a dot-product in scaled precision. For example, when a large-valued floating-point number is added to a small-valued floating-point number in reduced-precision arithmetic, the resulting sum does not adequately capture information from the small-valued number, particularly if the ratio of the large number to the small number is beyond the representation limits of the reduced-precision floating-point number. As expected, this problem worsens as the length of the dot-product increases.
Using this insight, we’ve broken up the accumulation component of deep learning dot-products into chunks. This allows us to drive precision scaling to FP8 for multiplications but uses higher-precision FP16 for accumulations. When applied hierarchically this approach allows all matrix and convolution operations to be computed using only 8-bit multiplications and 16-bit floating-point additions (instead of 16 and 32 bits, respectively, which are used widely today). Furthermore, unlike sorting-based summation techniques, this technique requires little additional computational overhead.
Figure 3 clearly shows the gains in model accuracy by employing our chunk-based accumulation approach during ResNet50 model training. Figure 4 further shows that we’ve implemented many of these algorithmic techniques in hardware (as shown in our 14 nm technology testchip layout), demonstrating that chunk accumulation engines can be used synergistically with reduced-precision dataflow engines to provide high performance without any significant increase in hardware overheads. The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2–4x improvement in throughput over today’s systems. Furthermore, the 2x reduction in training precision bit-width and the >2–4x improvement in training energy enable deep learning models to be trained, refined, and customized on a spectrum of edge devices while fully preserving model accuracy and maintaining privacy.
Path ahead for approximate computing
In addition to breakthroughs in training, we have published industry-first 4-bit inference accuracy results across deep learning workloads. We continue to drive innovations across algorithms, applications, programming models, and architecture to extend scalability of our digital AI core accelerators. Using our approach, a single chip architecture can execute training and inference across a range of workloads and networks large and small. The associated gains in performance and energy efficiency are paving the way for broad AI proliferation from cloud to the edge.
Approximate computing is a central tenet of our approach to harnessing the physics of AI, in which highly energy-efficient computing gains are achieved by purpose-built architectures, initially using digital computations and later including analog and in-memory computing.