What does Bayesian Inference mean for Neural Nets?
Now that we understand Bayes’s Theorem, let’s see how this is applicable for Regularizing Neural Networks. In past few posts, we learnt about how Neural Nets overfit data and also techniques to regularize the Network towards reducing bias and variance. (A high-variance state is a state when the network is overfitted).
One of the techniques to reduce variance and improve generalization is to apply weight decay and weight constraints. If we manage to trim the growing weights on a Neural Network to some meaningful degree, then we can control the variance of the network and avoid overfitting.
So let’s focus on the probability distribution of the weight vector given a set of training data. First, let’s relook at what happens in a Neural Network.
- We initialize the weight vector of a Neural Network to some optimal initial state.
- We have a set of training data that will be run through the network continuously which shall change the weight vector to meet a stated output during training.
- Every time we start with a new input (from the training data set) to train, we have a prior distribution of the weight vector and a probability of an output for the given input based on the weight vector.
- Based on the new output, a cost function calculates the error deviations.
- Back-propagation is used to fix the prior weights to reduce error.
- We seen a posterior distribution of the weight vector for a given training data.
The question we ask here is two fold:
- Can we use the Bayesian Inference in such a way that the weight distribution is made optimal to learn the correct function that relevantly maps the input to the output.
- Can we ensure that the network is NOT overfitting.
To recap, mathematically, if ‘t’ is a expected target output and ‘y’ was the output of the Neural Net, then local error is nothing but E=(t-y). The global error meanwhile can be a MSE as follows:

or a ESS as follows:

- Note that the dominant part of the equation is the squared Error in the equation.
- We are trying to find the weight vector that minimizes the squared errors.
- In likelihood terms, we can also state that we want to find the weight vectors that maximizes the log probability density towards a correct answer.
- Minimizing the squared error is the same as maximizing the log probability density of the correct answer. This is called Maximum Likelihood Estimation.
Maximum Likelihood Learning
First, let us look at the Maximum Likelihood learning before we apply Bayesian Inference. To do so, let’s assume that we are applying Gaussian Noise to the output of the Neural Network to regularize the network.
In the previous post titled “Mathematical foundation for Noise, Bias and Variance”, we used Noise as a regularizer in the input. Note that we can apply Noise even for the output.
Again, mathematically:

In other words, let the output for a given training case y_c be some function of an input x_c and the weight vector w.
Now assuming that we are applying a Gaussian Noise to the output, we get:

We are simply stating that the probability density of the target value given the output after applying Gaussian Noise is the Gaussian distribution centered around the output.
Let’s use negative log probability as the cost function as we want to minimize the cost. So we get:

When we are working on multiple training cases ‘c’ in the dataset ‘D’, we intend to maximize the product of the probabilities of output of every training case ‘c’ in the dataset ‘D’, to be closer to the target. Since the output error for every training case is NOT dependent on the previous training case. We can mathematically state this as :

In other words, the probability of observed data given a weight vector ‘w’ is the product of all probabilities of training case given the output. (Note that the output y_c is a function of inputs x_c and weight vector ‘w’).
But, instead of the product of the probability of the target value given an output, we stated that we can work in the Log domain by taking negative log probabilities . So we can instead work on maximizing the Sum of log probabilities as shown:

The above is the log probability of observed data, given a weight vector that helps in maximizing the log probability density of the output to be closer to the target value (assuming we are adding a Gaussian noise to the output).
Bayesian Inference and Maximum A Posteriori (MAP)
We worked on a equation for the Maximum Likelihood learning, but can we use the Bayesian Inference to regularize the Maximum Likelihood?
Indeed, the solution seems to lie in applying a Maximum A Posteriori or MAP in short. MAP tries to find the mode of the posterior distribution by employing Bayes’s Theorem. So for Neural Networks, this can be written as:

Where,
- P(w|D) is the posterior probability of the weight vector ‘w’ given the training data set D.
- P(w) is the prior probability of the weight vector.
- P(D|w) is the probability of the observed data given weight vector ‘w’.
- And, the denominator is the integral of all possible weight vectors.
We can convert the above equation to a cost function again applying the negative log likelihood as follows:

Here,
- P(D) is an integral over all possible weights and hence log P(D) converts to some constant.
- From the Maximum Likelihood, we already learnt the equation for log P (D|w)
Let’s look at log P(w), which is the log probability of the prior weights. This is based on how we initialize the weights. In the post titled “Is Optimizing your Neural Network a Dark Art ?” we learnt that the best way to initialize the weights is to apply a zero-mean-gaussian
So, mathematically:

So, the Bayesian Inference for MAP is as follows:

Again, notice the similarity of the loss function to L2 regularization.
Also note that we started we a randomly initialized zero-mean-gaussian weight vector for MAP and then started working towards fixing it to improve P(w|D). This has the same side-effect as L2 regularizers which can get stuck in local minima.
We take the MAP approach because a full bayesian approach over all possible weights is computational intensive and is not tractable. There are tricks with MCMC which can help approximate a unbiased sample from true posteriors over the entire weights. I may cover this later in another post.
Maybe now, you are equipped to validate the belief in God…