by Carles Gelada and Jacob Buckman
WARNING: This is an old version of this blogpost, and if you are a Bayesian, it might make you angry. Click here for an updated post with the same content.
Context: About a month ago Carles asserted on Twitter that Bayesian Neural Networks make no sense. This generated lots of good discussion, including a thorough response from Andrew Gordon Wilson defending BNNs. However, we feel that most responses missed the point of our critique. This blog post is a more thorough justification of our original arguments.
Proponents of Bayesian neural networks often claim that trained BNNs output distributions which capture epistemic uncertainty. Epistemic uncertainty is incredibly valuable for a wide variety of applications, and we agree with the Bayesian approach in general. However, we argue that BNNs require highly informative priors to handle uncertainty. We show that if the prior does not distinguish between functions that generalize and functions that don’t, Bayesian inference cannot provide useful uncertainties. This puts into question the standard argument that “uninformative priors” are appropriate when the true prior distribution is unknown.
What is Bayesian Inference?
In discussions on Twitter, many researchers seem to believe that “Bayesian” is synonymous with “uncertainty-aware”, or that any algorithm that uses sets or distributions of outcomes must be Bayesian. We would like to make it clear that in our view, this is not a fair characterization. The Bayesian approach to uncertainty, which involves updating prior distributions into posterior distributions using Bayes’ Rule, is certainly one of the most popular approaches. But there are other, non-Bayesian approaches as well; for example, concentration inequalities are clearly non-Bayesian, but they allow us to compute confidence intervals and uncertainty sets. (For reference, the word “Bayesian” in Bayesian Neural Network is, in fact, a reference to Rev. Bayes. Surprising but true!)
At its core, Bayes’s Rule is a relationship between conditional probability distributions:
\[Pr(A=a \mid B=b)=\frac{Pr(B=b \mid A=a) Pr(A=a)}{Pr(B=b)}\]This is a powerful, fundamental relationship, to be sure; but any conceptions of “belief updating” or “distributions over possible worlds” are post-hoc interpretations. Bayes’s Rule simply says that for any two non-independent random variables \(A\) and \(B\), seeing that \(B\) took a specific value \(b\) changes the distribution of the random variable \(A\). In standard lingo, the term \(Pr(A=a)\) is called the prior, \(Pr(B=b \mid A=a)\) is the likelihood, and \(Pr(A=a \mid B=b)\) is the posterior. This wording stems from the fact that we have an original (prior) distribution for the random variable \(A\), and then use the observed \(b\) to provide an updated distribution (the posterior).
Let’s consider how we might apply the Bayesian framework to a classification problem. We have some input space $\chi$ and some output space $Y$, which we assume are discrete (for the sake of simplicity). There is a family of functions \(f: \chi \to T\) mapping between them; it’s useful to think of each $f$ as a vector \(f\in Y^\chi\), where indexing the vector at the index \(x\in\chi\) equates to evaluating the function,\(f_x = f(x)\). There exists some ground-truth function \(f^*: \chi \to Y\) that we are interested in. A Bayesian approach to the problem says that in the real world there is a random variable \(F^*\) of classification tasks, and that \(f^*\) is a sample from it. We will use \(Pr(F^*=f)\) to denote the distribution of \(F^*\). (From now on, we will just abbreviate it to \(Pr(f)\).) Since a dataset of input output pairs \(D=\{ (x_i, F^*(x_i)\}\) is most definitely not independent from \(F^*\), we can use Bayes’s Rule to know what the distribution of \(F^*\) is, given that we have observed \(D\):
\[Pr(f \mid D) = \frac{Pr(D \mid f)Pr(f)}{Pr(D)}\]The term \(Pr(D \mid f) = \prod_{x,y\in D } 1_{f(x) = y}\) just denotes that if \(F^*=f\), then the dataset would contain labels equivalent to the outputs of \(f\). Why is this conditional distribution interesting? Because if the dataset was informative enough, the distribution of \(F^*\) might collapse to a single point and we might not have any uncertainty over what \(f^*\) is. Even if the distribution does not collapse to a single point we could still do many interesting things with \(Pr(f \mid D)\). For example, we can provide estimates by marginalizing over it,
\[\hat{f} = \int Pr(f \mid D) f df\]Or by finding the maximum a posteriori estimator,
\[\dot{f}=\sup_f Pr(f \mid D)\]But even more interestingly, we can use the distribution to provide uncertainty: the distribution of what the particular outputs \(f^*(x)\) might be. Given a test point \(x\), we can output the probability that \(Pr(F^*(x)=y \mid D)\). This can be very important; for example, in many sensitive applications, it is essential to abstain from making predictions when uncertain. Up to this point, Bayesian methods look very appealing.
But there is one core problem with the Bayesian framework. In practice, we never have access to the prior distribution \(Pr(f)\)! Who could ever claim to know the real-world distribution of functions that solve classification tasks? Not us, and certainly not Bayesians. Instead, BNNs simply choose an arbitrary prior distribution \(q(f)\) over functions, and Bayesian inference is performed to compute \(q(f \mid D)\). The question of whether \(q(f)\) is close to the true distribution \(Pr(f)\) is swept under the rug. In the face of this issue, some Bayesians justify the validity of BNN methods by claiming that choosing “uninformative” distributions for \(q(f)\) is sound when the true distribution \(Pr(f)\) is unknown. However, the quality of the uncertainties outputted by BNNs is completely dependent on the prior \(q(f)\), so as we will show, questions about the mismatch of \(Pr(f)\) and \(q(f)\) should not be dismissed so quickly. Others take the opposite perspective, and claim that because neural networks convert the uninformative prior in weight space into a structured prior in function space, the prior is actually close enough to be good. But as we will also discuss, known properties of neural networks call that into question.
Uncertainties from Bayesian Neural Nets with Generalization-Agnostic Priors
In order to show the profound importance of priors in Bayesian networks, we introduce generalization-agnostic priors. Performing Bayesian inference with such priors cannot reduce the uncertainty of \(f^*(x)\) for any \(x\not\in D\). This will show that, for the Bayesian framework to be useful to deep learning, the priors used must be connected with the generalization properties of neural networks, by assigning higher probability to functions that generalize well than to those that don’t. To our knowledge, there is no current work to determine if current priors satisfy these necessary conditions, and in fact we provide some informal arguments for why it’s likely that they don’t.
Consider a dataset \(C=\{(x_i,y_i\}\) which contains all the pairs in \(D\) (i.e. \(D \subset C\)), but also contains some “corrupted” input-output pairs, \(\exists x,y\in C\), s.t. \(y\not=f^*(x)\). It was shown by Zhang et al 2017 that we can train a neural network to perfectly fit \(C\). In other words \(\exists \theta\) s.t. \(f_\theta(x) = y, \forall x,y \in C\). Thus, our networks have so much capacity that not only can they fit the correct labels, they can fit arbitrary corrupted labels! Of course, even though any network \(f_\theta\) trained on \(C\) will achieve \(Pr(C \mid f_\theta) = 1\), the performance on any test set \(D_\text{test}\) is going to be terrible. Define a prior \(q\) to be “generalization-agnostic” if \(q(f^*) \approx q(f_\theta)\). In other words, if it assigns similar probability to functions that generalize well (\(f^*\) or functions close to it) and to functions that generalize poorly (like \(f_\theta\)). What is the problem with these priors? Since the likelihood of the data for \(f^*\) and \(f_\theta\) is \(1\), and since the prior probabilities are similar, this implies that the posterior probabilities are also similar. This can be easily seen,
\[q(f^* \mid D) = \frac{q(D \mid f^*) \cdot q(f^*)}{q(D)} = \frac{1\cdot q(f^*)}{q(D)} \approx \frac{1\cdot q(f_\theta)}{q(D)} = q(f_\theta \mid D)\]By construction, \(f_\theta\) yields the wrong output for some test point \(x\), \(f^*(x) \not=f_\theta(x)\). Thus, under a generalization-agnostic prior, no matter how big the dataset \(D\) is, we will never be able to reduce the uncertainty on what the right output is. Clearly, for Bayesian inference to make sense, it’s crucial that our priors are capable of distinguishing between functions that generalize well and functions that don’t.
Are Current BNNs Generalization-Agnostic?
It’s common to use simple priors for BNNs, e.g. independent Gaussian distributions over the weights. Combined with the architecture of the neural network, this induces a structural prior in function-space. Are we really to believe that such a simple procedure is capable of distinguishing between nicely-generalizing networks and poorly-generalizing networks? The following two facts provide an intuitive argument that suggests that priors that are simple in weight-space (like Gaussians) lead to induced structural priors which are, in fact, generalization-agnostic. 1) Gaussian priors \(q\) are smooth in the sense that they assign similar probability to nearby points and 2) training a neural network on any dataset, corrupted or not, almost always results in a tiny change in the weights from their initializations. Thus, it seems reasonable to expect the prior probability \(q\) assigned to good weights (those which represent \(f^*\)) and bad weights (those which represent \(f_\theta\)) to be similar.
But in practice, BNNs do generalize to test points, and do seem to output reasonable uncertainty estimates. (Although it’s worth noting that simpler approaches, like ensembles, consistently outperform BNNs.) Wouldn’t that be impossible if their priors were generalization-agnostic? Well, there is another piece to the puzzle: approximation. Computing \(q(f \mid D)\) is a highly non-trivial task called Bayesian inference; a large community studies tractable approximations to this quantity. (For example, variational inference formulates the problem of computing \(q(f \mid D)\) as an optimization problem.) The trickiness of computing \(q(f \mid D)\) could actually be the key to why BNNs with generalization-agnostic priors do something reasonable, despite their true posteriors being useless. They might not be learning anything close to the true posterior! In other words, the B in BNN might…not be doing much.
We do want to emphasize, though, that this section contains pure speculation. Also, it could be the case that there exists work already addressing these questions that we are unaware of. If that is the case, we would highly appreciate any references, and we will update this blog post accordingly.
A Sober Look at Bayesian Neural Networks
Good uncertainty estimates must be centered around the generalization properties of NNs. To have any guarantees that the uncertainties provided by BNNs are useful, we first need to understand what makes a specific neural network \(f\) generalize well or generalize badly. That would allow us to define priors with which to perform Bayesian inference. But we simply don’t have that understanding yet.
So viewed through this lens, BNNs with arbitrary priors are neural networks with a particular architectural decision. A BNN is a neural function mapping each input to a distribution over outputs; the prior is a hyperparameter of the model. Making the network Bayesian bought us nothing. It will only be helpful if we find a good prior, and validate that we are actually doing accurate inference. If you personally believe that exploring this space of priors (similar to exploring the space of architectures or hyperparameters) is particularly promising, then that is a good reason to keep working on BNNs.
But when a Bayesian tells you that BNNs provide good uncertainty estimates, that is equivalent to claiming that they have access to a good prior in weight or function space. We should ask, “what evidence are you providing that your priors are any good?” The onus is on the Bayesian community to demonstrate that they are.
Regardless of whether you believe that we can find good generalization-aware priors, it’s important that we, as a field, stop ignoring the crucial importance that the prior plays in the Bayesian framework. We need to think critically and not be swayed by sloppy arguments like “uninformative priors are good under uncertainty.”
Note on Neural Tangent Kernel line of work. NTK is interesting because we actually have gained something from thinking about infinite-width nets as Bayesian models. They become linear! And thus, simple to analyze. They provide insights into how to think about standard neural network learning. We think that the likelihood that NTK will be of practical use is low, since the computational cost and scalability of these modes is much higher than standard deep nets, but we do expect this research direction will have long term impact on the field from a theoretical standpoint. We are personally very excited about this line of work. For example, we think it can provide valuable insights on solving the notorious stability issues of deep RL methods.
A Final Note
A little bit of context for this blog post. We’ve been thinking about uncertainty in NNs because of our current research is centered on showing the how taking into account uncertainties is essential for sample-efficient RL. The tweet thread linked in the beginning summarized our opinions around uncertainties and BNNs, and it received a fair bit of attention. Overall, it led to a set of very lively and interesting discussions, a blog post reply, and an enormous amount of interesting references. But surprisingly, some senior members of the community responded with some personal attacks on Carles to an audience of tens of thousands. This behaviour strongly discourages young researchers to publicly think and discuss ideas. We all understand that putting your ideas out in the world means that they will get critiqued and picked apart, especially when you are critiquing the foundations of a whole field, but the discussion should stay exclusively on the science. Even more so when the two parties have different levels of seniority and influence. The right to talk about what research directions are promising and which ones aren’t should not be reserved to professors and well-established researchers. If young people don’t feel comfortable to talk openly, we will all miss important ideas.
Hit us up on Twitter (Carles and Jacob) to continue the discussion! Come at us, Bayesians ;-)