#### AI

# Estimating Information Flow in Deep Neural Networks

June 12, 2019 | Written by: Kristjan Greenewald, Igor Melnyk, and Brian Kingsbury

Categorized: AI

Share this post:

While there is a growing body of work studying the theoretical properties of neural networks, our understanding of the macroscopic behavior of deep learning still leaves much to be desired. Questions persist around what drives the evolution of internal representations during training, the properties of learned representations, and how fully trained networks process information. In addition, much of what we do know is informed by anecdotal results.

The Information Bottleneck theory attempts to address these questions. Working closely with collaborators from MIT as part of the MIT-IBM Watson AI Lab, our ICML 2019 paper, “Estimating Information Flow in Deep Neural Networks,” analyzes Information Bottleneck theory both mathematically and empirically, with a particular focus on the “information compression” phenomenon it predicts.

## Information Bottleneck Theory

The Information Bottleneck theory ([Schwartz-Ziv & Tishby ‘17] and others) attempts to explain neural network generalization as it relates to information compression, i.e. the notion that mutual information between the input *X* and a hidden layer *T* (see Figure 1) quickly rises during training as the network learns to encode the input, and then slowly falls (compresses) as the network learns to discard information that is irrelevant to the task (Figure 2). Each successive layer is viewed as increasingly compressing the input. It is then argued that this elimination of irrelevant information makes the classifier generalize better because given a new unseen input, it should only extract the relevant information and not be misled by the irrelevant.

Figure 1: A feedforward deep neural network (notional).

Figure 2. Information bottleneck. Shown are trajectories of the mutual information in five hidden layers as training progresses.

While this is a somewhat tantalizing perspective, unfortunately the mutual information between the input *X* and a hidden layer *T* does not depend on the parameters of the network when the network is deterministic (as nearly all neural networks are in practice). To get around this, prior work has computed an estimate of mutual information formed by binning (quantizing) each neuron and computing the mutual information (which becomes the discrete entropy of the binned hidden layer). Figure 3 demonstrates that this measure is highly sensitive to bin size, confirming that it is not measuring the mutual information.

Figure 3: Inconsistency of binning estimators.

## Noisy Neural Networks and Mutual Information Estimation

While mutual information is non-informative when the network is deterministic, it is quite informative when the network is stochastic. We therefore defined noisy neural networks formed by adding Gaussian noise *Z* at the output of each of the neurons (Figure 4). This noise is present in both training and testing, making associated mutual information estimates meaningful. We proposed an efficient estimator for the mutual information in this setting that provably converges to the true mutual information at the minimax optimal rate (and does not rely on binning).

Figure 4. Noisy neural networks.

## Clustering as the Driver of Compression

By relating single-neuron classification to information transmission over a noisy channel, our paper was able to develop a mathematical intuition that information compression (observed either rigorously in a stochastic network or using binned estimates in a deterministic network) should always be caused by clustering of the internal representations. Specifically, the hidden layers map different inputs *X* of the same class *Y *increasingly close together in the hidden representation *T*.

To evaluate this empirically, consider the data and model of [Shwartz-Ziv & Tishby ‘17] for binary classification of 12-dimensional inputs using a fully connected 12–10–7–5–4–3–2 architecture with tanh activations. Figure 5 shows results with additive noise with standard deviation 0.005 (test accuracy 97 percent), illustrating the relation across training epochs between the estimated mutual information, train/test losses, and the evolving internal representations. The rise and fall of mutual information correspond to how spread out or clustered the representations in each layer are. For example, the mutual information grows until epoch 28, when the Gaussians start to move away from each other along a curve (see scatter plots of the Layer 5 hidden representations on the top). Around epoch 80 they start clustering and the mutual information drops. As training progresses, the saturating *tanh* units push the Gaussians to opposite corners of the cube, reducing the mutual information even more.

Figure 5. Compression of I(X;T) as training progresses. Shown on the top row are scatter plots of the final layer hidden representations at selected epochs, color coded by class label.

Using orthonormal regularization of the weights [Cisse et al.’17], we can eliminate this compression while actually improving generalization, as shown in Figure 6. The hidden representations no longer cluster together, which directly parallels the lack of information compression. More experiments in this vein are in our paper, building strong confirmation that information compression is caused by clustering.

Figure 6: Removing compression with orthonormal regularization.

## Takeaways

This notion of “compression” being driven by clustering is important for two reasons. First, it demystifies the notion of “information compression,” replacing it with a more concrete formulation. Secondly, it opens the door to studying clustering directly, which may not suffer from the extreme curse of dimensionality associated with mutual information estimation (we proved the sample complexity scales exponentially in dimension). In fact, we were able to scale up several (preliminary) measures of clustering to full-scale convolutional neural networks designed for classification on the MNIST scanned digits task, observing similar “compression” behavior during the training process.

Furthermore, contrary to the Information Bottleneck Theory, we found that compression is not necessary for generalization, though it is still an open problem as to whether encouraging compression (through geometric clustering) is a path forward towards encouraging better generalization performance.

We will be presenting this work at ICML (Wednesday Jun 12 from 11:40-12:00 in the Deep Learning Theory session). Come see us then.

[Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. arXiv:1703.00810, 2017]

[Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the International Conference on Machine Learning (ICML), 2017]

Estimating Information Flow in Deep Neural Networks, *Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, Yury Polyanskiy*

**Kristjan Greenewald**

Research Staff Member, IBM Research

**Igor Melnyk**

Research Staff Member, IBM Research

**Brian Kingsbury**

Research Staff Member, IBM Research

### Text2Scene: Generating Compositional Scenes from Textual Descriptions

At CVPR 2019, IBM researchers introduce techniques to interpret visually descriptive language to generate compositional scene representations from textual descriptions.

### Overcoming Challenges In Automated Image Captioning

At CVPR 2019, IBM researchers introduce an improved method to bridge the semantic gap between visual scenes and language to produce diverse, creative and human-like captions.

### Label Set Operations (LaSO) Networks for Multi-Label Few-Shot Learning

Data augmentation is one of the leading methods to tackle the problem of few-shot learning, but current synthesis approaches only address the scenario of a single label per image, when in reality real life images may contain multiple objects. The IBM team came up with a novel technique for synthesizing samples with multiple labels.