December 12, 2019 | Written by: Xiao Sun and Kailash Gopalakrishnan
Categorized: AI Hardware
Share this post:
Over the past few years, reduced precision techniques have proven exceptionally effective in accelerating deep learning training and inference on AI hardware. State-of-the-art hardware platforms for training deep neural networks (DNNs) have largely evolved from a traditional single precision floating point (FP32-bit) computations towards FP16-bit precision, in large part due to the high-energy efficiency and smaller bit storage associated with using reduced-precision representations.
IBM Research has played a leading role in developing reduced precision technologies and pioneered a number of key breakthroughs, including the first 8-bit training techniques (presented at NeurIPS 2018), and state-of-the-art 2-bit inference results (presented at SysML 2019).
For more computation intensive training, the choices are limited beyond FP16-bit precisions. The most common 8-bit solutions that adopt an INT8 format are limited to inference only, not training. In addition, it’s difficult to prove whether existing reduced precision training and inference beyond 16-bit are preferable to deep learning domains other than common image classification networks like ResNets50.
At this year’s NeurIPS conference, IBM Research continues to advance its 8-bit training platform to improve performance and maintain accuracy for the most challenging emerging deep learning models, as presented in the NeurIPS paper “Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks.”
Last year IBM Research demonstrated a FP8 scheme which robustly worked with convolutional networks such as Resnet50. This new hybrid method for training fully preserves model accuracy across a broader spectrum of deep learning models. The Hybrid FP8-bit format also overcomes previous training accuracy loss on models like MobileNet (Vision) and Transformer (NLP), which are more susceptible to information loss from quantization. To overcome this challenge, the Hybrid FP8 scheme adopts a novel FP8-bit format in the forward path for higher resolution and another FP8-bit format for gradients in the backward path for larger range.
Figure 1: Hybrid FP8 scheme chooses 2 different FP8 formats – (1,4,3) with bias for weights and activations and (1,5,2) for gradients with loss scaling.
This new hybrid format fully preserves model accuracy across a wide spectrum of deep learning models in image classification, natural language processing, and speech and object detection.
Baseline vs, Hybrid FP8 training on Image, Language, Speech, and Object-Detection Models
Figure 2: IBM Research’s HFP8 scheme achieves comparable accuracy to FP32 across a suite of complex models for vision, speech, and language.
This new scheme offers several benefits. For existing models already trained in higher-precision formats, the new FP-8bit formats demonstrated straightforward mapping of such networks to inference deployments without re-training. Low-precision inference models typically require time-consuming re-training the network for the those formats. However, this 8-bit solution requires no tuning or quantization-aware training prior to inference deployment.
Finally, when local computation has been accelerated by the 8-bit training, the communication of weight gradients becomes the bottleneck in distributed learning. This new weight update protocol for distributed training broadcasts 8-bit weights for better use of bandwidth. This approach further reduces training time by 30-60 percent.
HFP8 is part of IBM Research’s work on Digital AI Cores within the IBM Research AI Hardware Center, opened earlier this year, and part of the center’s ambitious roadmap for AI acceleration. These advances support a critical need of AI hardware to handle increased model processing power while managing energy consumption.