“To think is to forget a difference, to generalize, to abstract. In the overly replete world of Funes there were nothing but details, almost contiguous details” / “Funes, The Memorious”, Jorge Luis Borges.
In my previous blog post, I tried to present the case for information theory in machine learning. We discussed the most important tools and theorems of the field and saw some examples to its uses in machine learning. In this post, I will try to use these tools to explain the information bottleneck problem and its possible connection to the reasons deep neural networks work so well. Looking at some recently published papers, we will see that the process of forgetting details is an essential part of the training process and explain the connection of this process to the information bottleneck problem.
The main source for this post is a paper published last year by Ravid SchwartzZiv and Naftali Tishbi, attempting to explain the magic behind deep neural networks. The paper, “Opening the Black Box of Deep Neural Networks Through information”, was received with excitement almost immediately by the community, thanks to the insightful theory and examples. The main objective of the paper was to present a longestablished informationtheoretic problem, known as the information bottleneck, as a theoretical equivalent to deep neural nets. This may provide a better understanding of the training process and the reasons deep neural networks work so well.
The paper attracted a lot of attention: You can go read some great insights here and here, to name a few. As is to be expected, the paper also generated a fair amount of controversy. A submission to ICLR demonstrated that the results may be quite specific to the architecture chosen by SchwartzZiv and Tishbi, and cannot be observed when small changes are made to it. This blog post also gives some interesting food for thought.
The Information Bottleneck Problem
Before tackling the information bottleneck itself, here’s just a reminder of the relevant concepts from my previous post:

 The entropy of a random variable X, defined as: $latex H(X) = \sum\limits_{x} P_X(x)\log P_X(x)$, represents the “mess” inherent in the variable.
 The mutual information between two random variables, defined as : $latex I(X;Y) = \sum\limits_{x,y} P_{XY}(x,y) \log \frac {P_{XY}(x,y)}{P_X(x)P_Y(y)} = H(X) – H(XY)$, subtracts the amount of “mess” left in X when Y is known from the “mess” inherent in X. It is thus a way to measure the amount of information Y carries about X.
 The Data Processing Theorem states that given a Markov chain $latex X \leftrightarrow Y \leftrightarrow Z$, the following is true: $latex I(X;Z) \leq \min \{I(X;Y), I(Y;Z)\}$. In other words, if the only connection between X and Z is through Y, the information that Z gives about X cannot be bigger than the information that Y gives about X.
The Information Bottleneck problem was first introduced by Tishbi himself in 1999 as a way to quantify the tradeoff between rate and the preservation of relevant information. Up until that point, information theory didn’t usually consider the relevance of information. While the ratedistortion theory allowed the consideration of tradeoffs between communication rate and the amount of data that would survive it, Tishbi et al noticed that it was not adequate in order to consider the relevance of information, for several reasons: Firstly, a distortion measure has to be defined before a ratedistortion problem can be defined. This poses a big problem, since the answer to the question ‘how much relevant information survives the communication?’ becomes dependent upon the choice of that distortion measure. Secondly, ratedistortion theory is especially powerful in cases where the distortion measure is assumed to be additive: $latex d(\bold{x}, \bold{\hat{x}}) = \sum\limits_{x} d_i(x_i, \hat{x}_i)$, where $latex \bold{x}$ and $latex \bold{\hat{x}}$ are vectors whose components are $latex x_i$ and $latex \hat{x}_i$, respectively. This of course excludes many possible cases, such as when relevant information exists only in one part of the vector. There are ample sources about the fascinating theory of ratedistortion. If you are interested, this Wikipedia page may be a good place to start.
The introduction of the information bottleneck aimed to remedy these problems. The main premise is as follows: Consider a random variable X, that has some hidden characteristic Y. Transmitting X to a different location, we are not interested in recreating X, but only in the characteristic Y. A simple example of this is spoken language and natural language understanding – while it is possible to transmit the spoken sound waves to a central processor, we could save a lot of resources by transmitting a transcript, while not losing a lot of relevant information. Note that a lot of information will be lost, making it harder to decipher intonation or to identify the speaker, but most of the relevant information would stay intact. It would be very hard to find an additive distortion measure that would indicate good performance if we try to analyze this approach through ratedistortion theory.
With that we are ready to present the information bottleneck problem: Given an information source X, we would like to create a compressed representation of the information, T, such that the rate is minimal, so that communication is facilitated, while keeping the maximal amount of information about a specific characteristic of X. This characteristic is represented by a random variable named Y. Mathematically, the problem takes the following format:
$latex \min\limits_{P_{TX}}\{I(X;Y) – \beta I(Y;T)\}$
The coefficient $latex \beta$ controls the tradeoff between the importance of “forgetting” irrelevant parts of X while creating T and the importance of maintaining relevant parts for Y. When $latex \beta = 0$ only compression is important and the minimum can easily attain the value 0 by choosing T = 0. When $latex \beta$ is very large, compression is no longer important and the minimum is achieved by choosing T = X.
The only remaining mystery is then why is this problem called the information bottleneck? The answer is quite simple – creating T as a compressed version of X guarantees that the Markov chain $latex X \leftrightarrow X \leftrightarrow Y$ exists. The data processing theorem then tells us that the value we wish to maximize, I(T;Y), is bounded from above by I(T;X), which is the value we would like to minimize! Thus, trying to maximize I(T;Y) while minimizing I(T;X) at the same time creates a bottleneck effect – the relevant information for Y must be squeezed through the bottleneck created by the compression of X.
The Black Box of Deep Neural Networks
In their paper “Opening the Black Box of Deep Neural Networks through Information”, SchwartzZiv and Tishbi claim to explain the success of deep neural networks through the information bottleneck problem. The main idea is as follows: In any neural network, the output of each layer can be viewed as a random variable T. This random variable is fabricated from the entry to the network X, and thus the Markov chain $latex T \leftrightarrow X \leftrightarrow Y$ is respected. As this can be claimed for any layer in the network, each layer can be treated simultaneously as the exit of an encoder with entry X, and the entry to a decoder with the output being the prediction Y.
Denoting the outputs of each of the layers as $latex T_1, T_2, \ldots$, we can deduct from the data processing theorem that:
$latex I(X;Y) \leq I(T_1; Y) \leq I(T_2; Y) \leq \cdots \leq I(\hat{Y}; Y)$.
We clearly aim to augment the final mutual information in this chain. Each of the representations T can thus be seen in the informationbottleneck perspective. The principle of viewing a DNN as a chain of information bottlenecks is not entirely new. This approach was mentioned before, including by some of the same authors. The novelty of this newer paper is, in my opinion, mostly in the demonstrations. The authors start by building a simple DNN model, comprising a total of 7 layers of sizes 12, 10, 7, 5, 4, 3 and 2 neurons. They use artificial datapoints X with binary labels Y, such that $latex I(X;Y) \simeq 0.99$. That is, they assure that the information about the label Y is indeed present in X, with very little noise. Training the chosen network on the data, I found the following observations exciting:
Looking at any layer, its outputs T can be drawn on the ‘information plain’ defined by the mutual information with both the input X and the expected label Y. Doing so, the observation is that the training process can clearly be divided into two. In the first part, called the ‘drift’, the network learns to extract the correct label Y from the information X relatively well. It does so by increasing the mutual information of each of the layers T both with the input X and with the label Y. In the second, slower, part of the training process, the network maintains the mutual information with Y while decreasing the mutual information between X and T.
The same effect can be seen in this video, where points of different colors represent the position of the different layers of the network on the information plain through time.
Note that rather unintuitively, in both graphs the layers that are closest to the network’s entry are on the righthand side, as they predictably present the highest mutual information with X.
Looking at the training process from different angles, such as the mean and variance of the changes to coefficients induced by the stochastic gradient descent process, yields similar observations. Intuitively, it is easy to understand why the network would seek to increase the mutual information with the labels Y, but the second step, where the mutual information with X is decreased, is a bit harder to understand – what can be gained by decreasing the information of each layer T about the inputs, when we keep in mind that there are no communication constraints? It turns out that this step helps generalization – by forgetting specific characteristics of the inputs that have little to do with the labels, the network can generalize better to examples that would not carry the same characteristics. This also explains why this phase of training is slower than the first – it happens randomly as the network sees more and more examples that may or may not carry these “unimportant” characteristics.
Conclusions and controversy
While these results are impressive and exciting, I think that some questions remain unanswered. I, for example, am wondering about the diffusion phase: As it is shown to be largely beneficial in helping the network generalize, it would seem that it’s a good idea to let it be, even if it may be a relatively long process. This stands in contrast to the fact that in many cases, we know that early stopping is a good regularization strategy in neural network training. This fact raises the concern that the results presented are not completely general and don’t apply for some cases. It would be interesting to know then, in which cases should training be stopped for fear of overfitting, and in which cases does it help generalization? Can this difference in the effect of long training be explained analytically through the information bottleneck?
Another concern I have with the paper, which also has to do with the generality of the results, is with some of the choices made by the authors. It seems to me that the fact that the classification problem was not taken from real data but fabricated artificially to have special characteristics means that it is at least worthwhile checking if the same observations can be made over real data. The choice of network architecture, which has layers in decreasing sizes from the input to the output, also seems to be quite “convenient” for the hypotheses the authors are trying to prove.
These concerns and many more were addressed in a recent submission to ICLR. The main results of this paper were also recently summarized very nicely in a blog post. The authors of this paper reproduced SchwartsZiv and Tishbi’s results and then proceeded to testing their approach on different network architectures. They noticed that changing the activation function from a tanh nonlinearity (except for the final layer) to a ReLU nonlinearity changed the observed results drastically:
Investigating further, the authors consider a simple 3node network, which is in fact nothing more than a perceptron with a nonlinearity. They pass a Gaussian signal through this node, which multiplies it by a learned coefficient and then passes the result through a nonlinearity to create a binary prediction. Even in this simple system, it turns out that only with the tanh nonlinearity a decrease in the mutual information with the entry X can be observed in later parts of the training process. ReLU nonlinearity does not exhibit this kind of behaviour.
These results demonstrate clearly that some more work is required in order to achieve complete understanding of the relationship between the information bottleneck and deep neural networks. While the idea itself is very interesting and has theoretical merit, empirical results are ambiguous. In time, I hope more interesting results will show up, as well as new approaches to analyzing and understanding them.
What do you think? Do you find the information bottleneck to be a useful insight to deep neural networks? Does it help your understanding of them? Leave your comments below.
Neta
Very interesting and well written!
Thanks you very much
Anton
Thank you , i enjoyed reading it alot !
Anonymous
Really Informative Article, thanks for Making it easy to understand!
sar
Is it that the principle is completely wrong,as it applies t very very specific cases only ?????