AI Hardware

Extending 8-bit training breakthroughs to the toughest deep learning models

Share this post:

Over the past few years, reduced precision techniques have proven exceptionally effective in accelerating deep learning training and inference on AI hardware.  State-of-the-art hardware platforms for training deep neural networks (DNNs) have largely evolved from a traditional single precision floating point (FP32-bit) computations towards FP16-bit precision, in large part due to the high-energy efficiency and smaller bit storage associated with using reduced-precision representations.

IBM Research has played a leading role in developing reduced precision technologies and pioneered a number of key breakthroughs, including the first 8-bit training techniques (presented at NeurIPS 2018), and state-of-the-art 2-bit inference results (presented at SysML 2019).

For more computation intensive training, the choices are limited beyond FP16-bit precisions. The most common 8-bit solutions that adopt an INT8 format are limited to inference only, not training. In addition, it’s difficult to prove whether existing reduced precision training and inference beyond 16-bit are preferable to deep learning domains other than common image classification networks like ResNets50.

At this year’s NeurIPS conference, IBM Research continues to advance its 8-bit training platform to improve performance and maintain accuracy for the most challenging emerging deep learning models, as presented in the NeurIPS paper “Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks.”

Last year IBM Research demonstrated a FP8 scheme which robustly worked with convolutional networks such as Resnet50.  This new hybrid method for training fully preserves model accuracy across a broader spectrum of deep learning models. The Hybrid FP8-bit format also overcomes previous training accuracy loss on models like MobileNet (Vision) and Transformer (NLP), which are more susceptible to information loss from quantization. To overcome this challenge, the Hybrid FP8 scheme adopts a novel FP8-bit format in the forward path for higher resolution and another FP8-bit format for gradients in the backward path for larger range.


Figure 1: Hybrid FP8 scheme chooses 2 different FP8 formats – (1,4,3) with bias for weights and activations and (1,5,2) for gradients with loss scaling.

This new hybrid format fully preserves model accuracy across a wide spectrum of deep learning models in image classification, natural language processing, and speech and object detection.

Baseline vs, Hybrid FP8 training on Image, Language, Speech, and Object-Detection Models


Figure 2: IBM Research’s HFP8 scheme achieves comparable accuracy to FP32 across a suite of complex models for vision, speech, and language.

This new scheme offers several benefits. For existing models already trained in higher-precision formats, the new FP-8bit formats demonstrated straightforward mapping of such networks to inference deployments without re-training. Low-precision inference models typically require time-consuming re-training the network for the those formats. However, this 8-bit solution requires no tuning or quantization-aware training prior to inference deployment.

Finally, when local computation has been accelerated by the 8-bit training, the communication of weight gradients becomes the bottleneck in distributed learning. This new weight update protocol for distributed training broadcasts 8-bit weights for better use of bandwidth.  This approach further reduces training time by 30-60 percent.

HFP8 is part of IBM Research’s work on Digital AI Cores within the IBM Research AI Hardware Center, opened earlier this year, and part of the center’s ambitious roadmap for AI acceleration. These advances support a critical need of AI hardware to handle increased model processing power while managing energy consumption.

Research Staff Member

Kailash Gopalakrishnan

Distinguished Research Staff Member, IBM Research

More AI Hardware stories

Getting AI to Reason: Using Neuro-Symbolic AI for Knowledge-Based Question Answering

Building on the foundations of deep learning and symbolic AI, we have developed a software able to answer complex questions with minimal domain-specific training. Initial results are encouraging – the system achieves state-of-the-art accuracy on two datasets with no need for specialized training.

Continue reading

The Open Science Prize: Solve for SWAP gates and graph states

We're excited to announce the IBM Quantum Awards: Open Science Prize, an award totaling $100,000 for any person or team who can devise an open source solution to two important challenges at the forefront of quantum computing based on superconducting qubits: reducing gate errors, and measuring graph state fidelity.

Continue reading

Improving resource efficiency for Kubernetes clusters via load-aware scheduling

Unfortunately, there are no default scheduler plugins in Kubernetes to consider the actual load in clusters for scheduling. To achieve that goal, we developed a way to optimize resource allocation through load-aware scheduling and submitted our "Trimaran: Real Load Aware Scheduling" Kubernetes enhancement proposal, with the hope of soon merging this feature into the Kubernetes scheduler plugin.

Continue reading