Demo of event-based gesture recognition system IBM researchers will show at CVPR 2017
Event-based computation is a biologically-inspired paradigm for representing data as asynchronous events, much like neuron spikes in the brain. The Brain-Inspired Computing group at IBM Research – Almaden has built the first gesture-recognition system implemented end-to-end on event-based hardware. Combining the IBM TrueNorth neurosynaptic processor with an iniLabs Dynamic Vision Sensor (DVS), we trained a spiking neural network to recognize 10 hand gestures in real-time at 96.5 percent accuracy within a tenth of a second from the start of each gesture, while consuming under 200 mW – much lower power than frame-based systems, which use traditional processors.
Event-based devices like the brain-inspired TrueNorth processor and DVS event camera are inspired by the brain and retina, whose architectures are fundamentally different from a CPU and digital camera. The brain solves complex vision problems faster and at lower power than conventional computers, even though its neurons and synapses are individually much slower than silicon transistors. In part, this is because biological neurons communicate with sparse, asynchronous events called spikes, which are only transmitted when a neuron detects enough input.
Because spikes are sent on-demand, biological neurons communicate more efficiently than conventional digital devices. For example, a digital camera scans out every pixel at a fixed frame rate, even when nothing changes in the scene, wasting power and bandwidth to send redundant information. By contrast, the eye’s retinal ganglion cells transmit spikes down the optic nerve only when they sense an actual stimulus in their local region of the visual field.
Figure 1: Unlike a conventional digital camera that samples a dense frame of pixel values at a periodic rate (top), the DVS event camera represents the spatiotemporal trajectory of a waving hand as a sparse stream of pixel events (bottom).
Mimicking the retina, an event camera like the iniLabs DVS transmits a data packet whenever a pixel detects a change in its own illumination (Figure 1). Since a pixel does not have to wait for the next frame readout to trigger a transmission event, the DVS reacts to stimuli within microseconds – tens of thousands of times faster than a consumer digital camera capturing video at 30 frames per second.
For best results, the DVS event stream should be processed by another natively event-based device. IBM’s TrueNorth processor is ideal for this purpose. TrueNorth is a massively parallel spiking-neural-network chip that can be configured with a network containing up to a million spiking neurons distributed across 4,096 neurosynaptic cores. TrueNorth programs, called corelets, are written in the Corelet Programming Language (CPE), a hierarchical, compositional, dataflow language implemented in MATLAB.
The DVS event camera sends pixel events over USB to a board with a single TrueNorth processor, which filters the DVS event stream for recognized gestures and sends them over Ethernet to an external display. The system draws DC power from an AC adapter.
Our end-to-end gesture-recognition system connects a DVS event camera via USB to a board with a single TrueNorth processor, which sends output via Ethernet to an external display (Figure 2). To recognize gestures in the DVS event stream, we configure the TrueNorth processor using a spiking convolutional neural network (CNN) corelet that we train offline using our Eedn algorithm for training efficient hardware-constrained CNNs. Additional corelets filter the input and output of the CNN (Figure 3). Upon recognizing a gesture, the TrueNorth processor transmits the corresponding output event to a laptop for visualization.
Figure 3: A single 4,096-core TrueNorth processor recognizes gestures using a spiking neural network that implements a temporal filter cascade to preprocess the DVS event stream (a-b), a 16-layer convolutional neural network to detect and classify gestures (c-g), and a pair of output filters to smooth the prediction stream (h-j).
To train the TrueNorth gesture classifier, we collected a new DVS dataset containing hundreds of gesture samples (Figure 4). Gestures included various single-hand waves, both clockwise and counter-clockwise; random distractor gestures invented by each subject; and – just for fun – air drums and air guitar.
Figure 4: Examples of five gestures: right-hand wave, left-hand wave, arm roll, air drums, and air guitar. The bottom row shows the actual DVS events, binned over 5 ms. DVS pixels emit separate ON (magenta) and OFF (cyan) events to encode an increase or decrease in pixel illumination. Notice that the background is automatically subtracted by the DVS event stream, making gesture recognition much easier.
We will present this work at the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2017) in Honolulu, Hawaii, on July 23rd (demo) and July 25th (poster), where we will describe how we overcame challenges like the very sparse, asynchronous representation of the sensory data, in which each event carries just a single bit of information; training a recognition network for hardware-constrained, low-precision computation; handling noisy event data; and more. This work is the culmination of a very fruitful collaboration between IBM Research and iniLabs that began in 2013 under the auspices of DARPA’s SyNAPSE program.
The DVS dataset will be available for public download here, and the software workflow for building and training this system will be released to the TrueNorth developer community.