# A Primer for Bayesian Deep Learning — Part 2

# Inference approximations

We cannot evaluate the true posterior *p*(** ω**|

**X**

*,*

**Y**) analytically since it becomes intractable. Instead, what we do is to specify a structure that is easy to evaluate i.e., an approximating

*variational*distribution , parametrised by . In other words, we use as a proxy for

*p*(

**|**

*ω***X**

*,*

**Y**) to make predictions or to investigate the posterior distribution of the hidden variables.

**Ideally, should be very close to**

*p*(

**|**

*ω***X**

*,*

**Y**).

Therefore, we measure the closeness between the two distributions (minimising) with Kullback–Leibler (KL) divergence [6] with regard to **θ**:

Keep in mind this integral this integral is strictly defined when is absolutely continuous with regard to *p*(** ω**|

**X**

*,*

**Y**) (i.e. for every measurable set

*A*,

*p*(

*A*|

**X**

*,*

**Y**) = 0 implies ).

KL divergence minimisation enables us to approximate the predictive distribution as

Minimising the KL divergence becomes the same as to maximising the *evidence lower bound *(ELBO) with respect to the variational parameters defining , which defines the objective we will refer to hereafter.

When we maximise the first time in the equation Eq. (2.4), also known as *expected log likelihood, *it urges* *to satisfactory describe the data. However, if we minimised the second term (the *prior KL*), it urges to be as close as possible to the prior. We can observe that this illustrates the Bayesian justification of Occam’s razor i.e. simple model makes sharper predictions than a more complex model.

The entire process is known as *variational inference *(VI) [23]. With variational inferencing, we substitute the calculations of integrals with that of derivatives. Unlike other approaches in deep learning, we optimise over distributions instead of point estimates. This method enables us to utilise the benefits of Bayesian modelling like the balance between complex models and models that describe the data well, as well as results in probabilistic models that capture model uncertainty

Approximations becomes tractable since calculating derivatives is much simpler than integrals. But this method does not scale well large data (evaluating requires calculations over the entire dataset), and it does not adapt to complex models (models in which this last integral cannot be evaluated analytically). Current advances in variational inference research enables us to solve these problems.

# Bayesian Deep Learning

Bayesian neural networks (BNNs) were first proposed in the ’90s [7] [8]. They offered a probabilistic understanding of profound learning models by surmising dispersions over the models’ weights. Some of the benefits exhibited by such models were robustness to overfitting, uncertainty estimates, and they could easily learn from limited dataset.

With BNNs, a prior distribution is placed over a neural network’s weights that which induces a distribution over a parametric set of functions. We usually place Gaussian prior distributions over the weight matrices given weight matrices **W***i *and bias vectors **b***i *for layer *i*, *p*(**W***i*) = *N* (**0***, ***I**). We assume a point estimate for the bias vectors for easiness. Likelihood specification typically follows the standard Bayesian literature (such as the softmax likelihood or Gaussian likelihood).

Bayesian deep learning place a prior distribution over a neural network’s weights, which induces a distribution over a parametric set of functions. Given weight matrices **W***i *and bias vectors **b***i *for layer *i*, we often place standard matrix Gaussian prior distributions over the weight matrices, *p*(**W***i*) = *N* (**0***, ***I**), and often assume a point estimate for the bias vectors for simplicity. Likelihood specification often follows the standard Bayesian literature (such as the softmax likelihood or Gaussian likelihood).

It is not as simple to perform inference with the models as it is to formulate them. It can be very **hard** with some interesting models.

In 1987 Denker et al [9] examined the general issue of learning from examples. The examined they generally literature and expanded on it proposing a new way of training. Basically, the proposed placing a prior distribution over the space of weights.

They defined a set of inputs {**x**1*, …, ***x***N*} and then mapped each weight configuration to a set of corresponding network outputs {**y**1*, …, ***y***N*}, enabling them to integrate and obtain a marginal probability for each output.

With the observed training data, the marginal probabilities were then updated i.e., the probabilities of inconsistent weights are set to zero. These marginal probabilities were then used to calculate its entropy.

Tishby et al [10] extended on the ideas of Denker et al [9] and developed a statistical framework to reason about network generalisation error. This could be the earliest citation for “Bayesian neural network”. They showed that the maximum likelihood with respect to a Gaussian likelihood over the network outputs as the only probabilistic interpretation of a neural network Euclidean loss. By defining a prior distribution over the network weights, they showed that inference could be performed by invoking the Bayes theorem. They proposed a quantity dependent on the training set alone that would correlate to the network’s generalisation error on a test set.

Denker and LeCun [11] built on the works by Tishby et al and suggested using the Laplace’s methods to approximate the posterior of the Bayesian NN. They optimised the neural network with backpropagation to find a mode and fitted a Gaussian determined by the Hessian of the likelihood to it.

Another extensive study was carried out by MacKay [7] in Bayesian NNs. MacKay utilised the approximation of Denker and LeCun for use of model evidence for model comparison. MacKay showed through various experiments of different model sizes and configuration that model evidence correlates to generalization error hence can be used to select model size.

He further illustrated situations where model misconfiguration could lead to the failure of Bayes rule, where model evidence doesn’t lead to generalisation. This could happen when the priors of large and small input weight magnitude are wrapped together.

Hinton and Van Camp [12] proposed utilising the minimum description length (MDL) to regularise the network weights. It penalised high amounts of information in a network’s weights. They showed the first variational inference approximation to Bayesian NNs. With a single hidden layer NN, they showed it was possible to compute their objective analytically.

Neal [13] suggested other forms of inference approximations for Bayesian NNs based *on Monte Carlo* (MC) techniques for example the Hamiltonian Monte Carlo (HMC) was used for posterior inference. HMC does not depend on any prior assumption about the posterior distribution. Neal tried to reproduce some of Mackay’s experiment. He could reproduce some but wasn’t able to reproduce other. This could be as a result of the approximation error of the Laplace’s method in which MacKay relied upon. Neal also showed that depending on the prior used, models would converge to various stable processes.

Barber and Bishop [14] developed Hinton and Van Camp’s MDL approximation under a VI interpretation and used full covariance matrices instead of diagonal covariance matrices. They found out that the objective forms a lower bound to the model evidence. They set gamma priors and then performed VI with free-form variational distributions over the network hyper-parameters, and derived their optimal form. The model evidence remained constant with respect to the variable hyper-parameters because we are constantly averaging with respect to their approximating distribution.

Recent research in BDL utilise either different variants of variation inference or sampling-based techniques. Graves [15] used *data sub-sampling *techniques in a fully factorised VI objective and **Monte Carlo estimates** [16] to approximated the intractable expected log likelihood which enabled efficient scaling of large data and complex models.

Hernandez-Lobato and Adams [17] proposed a method which scaled to large amount of data and complex models. However, the method didn’t perform well in practise. Later on, Blundell et al. [18] was able to build on Graves and Kingma and Welling [19] work by re-parametrising the expected log likelihood MC estimates. Blundell et al modified the BNN model by optimising the combination of Gaussians prior added over each weight. This approach performed well and matched state of the art in the literature. However, the approach can be computationally expensive. The number of model parameters vastly increased with not much increase in model capacity. For example, the number of model parameters doubled when Blundell et al. used Gaussian distributions for Bayesian NN posterior approximation. This makes the methodology difficult to use since it can be too expensive to implement.

Another approach to VI involves the use of expectation propagation. Hernandez-Lobato and Adam [17] utilised the probabilistic back propagation (PBP) which improved considerably the root mean square error (RMSE) and in uncertainty estimation on Graves’ approach. Other similar methods employed are approximate inference techniques based on *α*-divergence minimisation [17], [20] [21]. Although most of these are not yet practical to implement, some [17] were used in reinforcement learning [22]. These approximate inference techniques rely on various forms of Rényi’s *α*-divergence [23] in minimising objectives to VI’s ELBO. The reason these techniques are used is to avoid VI’s mode-seeking behaviour and seem to sacrifice their estimation of the dominant modes of the posterior.

During prediction, instead of interpolating between different modes, we are usually concerned with finding modes and its subsequent investigations. This is where Stochastic regularisation techniques (SRTs) could be used to regularise deep learning models through the injection of stochastic noise into the model. SRT techniques include

*dropout* [24], [25], multiplicative Gaussian noise (MGN, also referred to as *Gaussian dropout*) [25], or dropConnect [26], among many others [27], [28].

Dropout is a regularization technique used to avoid over-fitting in neural networks. It was introduced several years ago by [24] and studied more extensively in [25].

In the case of Multiplicative Gaussian noise, it is very similar to dropout with the only point of difference being that the binary vectors are vectors of draws from a Gaussian distribution with mean 1, N(1*, *1), rather than draws from a Bernoulli distribution.

# Conclusion

In summary, deep learning employs use of point estimates of parameters and predictions at hand, which constrains our tools for answering the above questions, eventually leading to situations where we cannot tell whether a model is producing practical predictions or simply guessing. NNs work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use neural networks with small data –as NNs are known to overfit quickly. With the adoption of a Bayesian interpretation, NNs could utilise the power of probabilistic methods to combine the benefits of both your networks and Bayesian modelling. They are able to utilise any kind of available information.

These methods utilise Gaussian processes to learn more likely and less likely approaches to generalise data. And therefore, this approach allows us to view the confidence bounds for decision-making in our model. This becomes very useful in situations where we are required to know whether a model is certain about its output, and to what extent is the model confidence. Questions like “Do we require more diverse data? Or modify the model? Or even be careful in decision-making?”, are core concerns in Bayesian deep learning.

Bayesian deep learning (BDL) has gained traction in recent years mainly due to the exponential explosion of research in machine learning models like neural networks and probabilistic models. What makes BDL so interesting is the introduction of model uncertainty to neural networks. They are able to utilise the perception quality of neural networks as well as the inference quality from Bayesian models.

Its popularity has birthed various application like active learning, link prediction, and Bayesian reinforcement learning, which enhance performance. With more advances in BDL, various variants of deep learning architectures like Bayesian Deep Convolutional Neural Networks (BCNN) could be utilised for efficient visual information representation. It is expected to be more scalable.

# References

- J. Dean, C. G, R. Monga, C. K, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato and P. A. T. K. Y. a. A. Y. N. A. W. Senior, “Large scale distributed deep networks,”
*NIPS,*p. 1232–1240, 2012. - Y. LeCun, L. Bottou, Y. Bengio and H. P, “Gradient-based learning applied to document recognition,”
*Proceedings of the IEEE,*vol. 86, no. 11, p. 2278–2324, 1998. - A. Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”
*Advances in neural information processing systems,*p. 1097–1105, 2012. - Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun, “Overfeat: Integrated recognition, localization and,”
*CoRR,*vol. abs/1312.6229, 2013. - Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic,”
*CoRR, abs,*vol. 1311, no. 2524, 2013. - S. Kullback., Information theory and statistics, John Wiley & Sons, 1959.
- C. D. J. MacKay, “A practical Bayesian framework for backpropagation networks,”
*Neural Computation,*vol. 4, no. 3, p. 448–472, 1992b. - N. Houlsby, F. Huszár, Z. Ghahramani and M. Lengyel, “Bayesian active learning for classification and preference learning,”
*arXiv preprint arXiv:,*vol. 1112.5745, 2011. - J. Denker, D. Schwartz, B. Wittner, S. Solla, R. Howard, L. Jackel and J. Hopfield, “Large automatic learning, rule extraction, and generalization,”
*Complex systems,*vol. 1, no. 5, p. 877–922, 1987. - N. Tishby, E. Levin and S. A. Solla, “Consistent inference of probabilities in layered networks: Predictions and generalizations,”
*International Joint Conference Neural Networks IEEE,*p. 403–409, 1989. - J. Denker and Y. LeCun, “Transforming neural-net output levels to probability distributions,” in
*Advances in Neural Information Processing Systems 3*, 1991. - G. E. Hinton and D. V. Camp, “Keeping the neural networks simple by minimizing the description length of the weights,”
*COLT,*pp. 5–13, 1993. - R. M. Neal, “Bayesian learning for neural networks,” PhD thesis, University of Toronto, 1995.
- D. Barber and C. M. Bishop, “Ensemble learning in Bayesian neural networks,”
*NATO ASI SERIES F COMPUTER AND SYSTEMS SCIENCES,*vol. 168, p. 1998, 215–238. - A. Graves, “Practical variational inference for neural networks,”
*NIPS,*2011. - M. Opper and C. Archambeau., “The variational Gaussian approximation revisited,”
*Neural Computation,*vol. 21, no. 3, p. 786–792, 2009. - R. A. Jose Miguel Hernandez-Lobato, “Probabilistic backpropagation for scalable learning of Bayesian neural networks,”
*ICML,*p. 2015. - J. C. K. K. D. W. Charles Blundell, “Weight uncertainty in neural network,”
*ICML,*2015. - M. W. Diederik P Kingma, “Auto-encoding variational Bayes.,”
*arXiv preprint arXiv,*vol. 1312.6114, 2013. - Y. Li and R. E. Turner, “Variational inference with r\’enyi divergence,”
*arXiv preprint arXiv,*vol. 1602.02311, 2016. - T. Minka, “Divergence measures and message passing,” 2005.
- S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez and S. Udluft, “Learning and policy search in stochastic dynamical systems with Bayesian neural networks,”
*arXiv preprint arXiv:,*vol. 1605.07127, 2016. - A. Rényi, “On measures of entropy and information,” in
*Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability*, 1961. - G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,”
*arXiv preprint arXiv,*vol. 1207.0580, 2012. - N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”
*JMLR,*2014. - L. Wan, M. Zeiler, S. Zhang, Y. LeCun and R. Fergus., “Regularization of neural networks using dropconnect,”
*ICML,*vol. 13, 2013. - G. Huang, Y. Sun, Z. Liu, D. Sedra and K. Weinberger., “Deep networks with stochastic depth,”
*arXiv preprint arXiv,*vol. 1603.09382, 2016. - D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle and A. Courville, “Zoneout: Regularizing RNNs by randomly preserving hidden activations,”
*arXiv preprint arXiv,*vol. 1606.01305, 2016.