October 25, 2019 | Written by: Inkit Padhi and Youssef Mroueh
Share this post:
Learning Implicit Generative Models by Matching Perceptual Features
The computer vision community is finding success in training deep convolutional neural networks (DCNNs) with pretrained models on large datasets to achieve state-of-the-art performance on object detection, style transfer, video recognition, and super-resolution. These features, referred to as perceptual features (PFs), are exploited for solving other problems through either fine-tuning or transfer learning. However, there is one sub-problem where richness of these PFs is not considerably explored: implicit generative models.
Can we use PFs to learn implicit generative models? This is the question we tried to answer in our work, “Learning Implicit Generative Models by Matching Perceptual Features,” being presented as an oral presentation at ICCV 2019 in Seoul, Korea, on October 31 at 9:18AM (Location: Oral 3.1A, Hall D1). The code is available on GitHub : https://github.com/IBM/gfmn.
In particular, we proposed a new “Moment Matching” method that learns implicit generative models by matching statistics from PFs extracted from pre-trained convolutional neural networks. We call this framework Generative Feature Matching Networks (GFMNs), which learns implicit generative models by matching mean and covariance statistics extracted from the all convolutional layers of a pre-trained DCNN.
Maximum Mean Discrepancy (MMD) based methods capture the difference between two distributions via embedding them into infinite-dimensional maps. Defining a kernel (or a similarity measure) to discriminate between the real and machine generated samples is challenging. One of the existing solutions involves using adversarial training for the online learning of kernel functions. However, adversarial training involves min-max optimization training, which can lead to instability. Our proposed method overcomes the weakness of existing min-max strategies in the following ways:
- Non adversarial : Doesn’t deal with min-max optimization challenges
- Perceptual Feature (PF) and Fixed Feature Matching: Doesn’t involve online learning of kernel functions, but leverages instead the richness of perceptual features and their abilities in discriminating between real and machine generated data
- Scalable: Involves ADAM based moving average; accommodates smaller batch size
A Closer Look
E = Pre-trained Feature Extractor (PF)
𝓏i= Noise Signal
μ jp-data = Features Mean of Real Data
X̂i = Generated Image = G(zi,q)
For training GFMN, we use noise vectors sampled from a normal distribution and pass it through a neural network generator. We get the PFs for these generated images and try to match its statistics (mean/variance) with the real training data statistics. Due to GPU scalability issues, we only match the diagonal covariances instead of full covariances. The statistics for the training data can be pre-computed before the start of the training.
ADAM Moving Average
In order to have better estimates of the statistics on generated images, we need a large minibatch. This is difficult with limited GPU capabilities. To address this, we apply moving averages (MA) of the differences between statistics for real and generated images.
Vj =Moving Averages of difference of statistics at layer j
During training, we can also estimate better MA with the help of ADAM optimizer on the loss of MAs.
Types of Feature Extractors
In order to study the impact of richness of PFs for learning the implicit generative models, we mainly tried two feature extractors:
- PFs from Autoencoder: Here, we train an autoencoder where the decoder has similar signature (DCGAN) as the generator. Once trained, we use the encoder as feature extractor.
i) Encoder : DCGAN Discriminator / VGG19
ii) Decoder : DCGAN / ResNet
- PFs from Classifiers: We use various DCNN models (VGG19- Resnet18) pre-trained in a supervised way on large scale dataset and use it as feature extractor. Due to the nature of tasks, these features seem to be more rich in information than auto-encoders ones.
We benchmarked GFMN- with either pre-trained autoencoder or cross-domain classifiers as features extractor, on the CIFAR10 dataset. The generated images are evaluated against two metrics: Inception Score, IS (higher the better) and Fréchet Inception Distance, FID (lower the better). We get the best performance when we use PFs from both pretrained VGG19 and Resnet18 and the Generator architecture similar to Resnet.
Further, we also benchmarked GFMN against various existing adversarial and non-adversarial generative models on CIFAR10 and STL10. GFMN performs comparable or better against most methods, including the state-of-the-art Spectral GAN (SN-GANs).