Delta-encoder is a novel approach for few- and one-shot object recognition, in which a modified auto-encoder (called delta-encoder) extracts transferable intra-class deformations (deltas) between same-class pairs of training examples, then applies them to a few examples of a new class (unseen during training) to efficiently synthesize samples from that class. The synthesized samples are then used to train a classifier. The result is a system that can learn to recognize new objects from only a few training examples.
While AI has reached the state where it can solve specific tasks at super-human level, the path to the next step—where AI can generalize—is far from trivial and poses many challenges that our research aims to address. The need for enormous amounts of data for training deep learning models, with only little ability to transfer knowledge gained from one task to another, is very different than the way humans learn. Imagine a little girl shown a photo of an animal she didn’t know before and told its name. She easily grasps the nature of it, based on what she already knows about animals. She will now probably be able to identify this animal in other photos, in the wild, and even when presented to her as a sketch.
In recent work from IBM Research AI, in collaboration with researchers from the Technion and Tel Aviv University, we present a new methodology for few-shot and one-shot object recognition—that is, the ability to learn to classify an image from a new category with only a limited number of examples from that category.
Below you can see how a single image of a golden retriever and a single image of an African wild dog, both with a firework blue frame, were sufficient for the system to correctly classify most of the images presented. In pink are correct classifications of golden retrievers, and in yellow are the correct classifications of African wild dogs.
If your eyes are sharp, you can even see that the African wild dog prototype shows mainly its head. So how can a computer correctly classify these images based on a single example, while in the famous ImageNet dataset, the number of images per class is usually at the hundreds? The secret is in the ability to synthesize, a bit like imagining, a much larger set of images of how an African wild dog would look in various scenarios. The system learns how to synthesize such a dataset by looking at pairs of images it already knows.
Let’s say we have two images of beagle dogs. If the system could correctly generate the second image, just by looking at the first image, it means that it learned how to synthesize a new image from a given one. Luckily, for the beagles we have quite a few images, as for other classes, so we can train the system to create this synthesis in a supervised manner by tweaking an encoder-decoder architecture.
In the standard setting, the encoder takes one image as input and learns to encode it in a compact representation (or as a distribution in a variational or denoising auto-encoder). Then comes the decoder, which receives this representation and creates a new one. This allows us to come up with a good representation of an image. But what we want to do is to represent the conceptual difference between two images, say between two images of beagles, and then use a decoder to create a new representation of an African wild dog out of our single image of an African wild dog, applying the same conceptual difference, or delta, as we call it. The conceptual difference can be seen as the “additional information” needed to reconstruct X from Y.
Have a look at the delta-encoder:
XS is our beagle #1
YS is our beagle #2
The encoder learns the representation of the difference between the two images, which is Z. It then needs to learn how to decode, or apply, this difference to our beagle #2 (YS) to get the original beagle #1 (XS). It actually outputs X̂S and the objective of the neural net is to minimize the difference between XS and X̂S.
When we have a good encoder that can represent the deltas and a good decoder that can generate a new representation given another image and a delta, we can use them to synthesize many more training examples:
This time, we feed the decoder with Yu, a single image of our African wild dog, to get X̂u, a synthesized representation of a new African wild dog that applies the conceptual difference between two known images of dogs, say beagle #1 (XS) and beagle #2 (YS) as captured by the encoder. Since we have many beagle images, we can create many pairs and synthesize a different African wild dog for each pair. We also have other dogs in our original dataset, so we can use pairs images from other classes as well, as long both images of the pair belong to the same class. Hence, from a single image of an African wild dog we now have plenty of synthesized representations, and we are using them to train a classifier.
In a recently published paper in this year’s INTERSPEECH, we were able to achieve additional improvement on the efficiency of Asynchronous Decentralized Parallel Stochastic Gradient Descent, reducing the training time from 11.5 hours to 5.2 hours using 64 NVIDIA V100 GPUs.
IBM scientists presented three papers at INTERSPEECH 2019 that address the shortcomings of End-to-end automatic approaches for speech recognition - an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits.
Recent advances in deep learning are dramatically improving the development of Text-to-Speech systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech.