Most General Mixture Prior
A lot of people use latent variable models with a location-scale prior. This post explores the most general, in a way “objective”, prior that one might employ when one wants to generalize to a multi-modal distribution.
This image will make sense after you’ve read the post, but I’m putting it up top as a fun thing to look at in an otherwise quite dry post.
To set the stage, imagine you are using a VAE, GAN, or other latent variable model. It’s working great, but there’s a thought gnawing at the back of your head.
“What if this distribution and all of the interpolation assumptions I am using to train my models are forcing a kind of unimodality?”
So you might reach for a trainable mixture prior, randomly initialized, or maybe you have read some of the literature and go for a VampPrior. But now you have a a new worry.
“My new prior is kind of a black box. I just wanted to express the fact that I expect a potentially multimodal distribution and I ended up with something so general that I’m worried the prior might be overfitting to the data now.”
A lot of papers talk about the amortized approximate posterior as the “optimal prior” for a VAE. This is a misnomer, except in a very limited sense. It is true that, given the set of approximate posteriors, the prior which is optimal when optimizing for performance on the ELBO for the training data will be the amortized posterior. However, this prior will overfit the training data when it is allowed to be dynamic so long as the posterior is expressive enough and the training procedure optimal enough. The scenario is as follows: The approximate posterior converges to a dirac delta for each data point s.t. no two overlap. The amortized approximate posterior is then a distribution which assigns mass $1/n$ to each data point, incurring an optimal cost of $\log(n)$ in the ELBO. We are now training a pure autoencoder with no regularization and our generative model is just to sample a point from our training set and pass it through said autoencoder, which likely achieves a perfect fit (for any good decoder this should be true). With a more “suboptimal” prior that is fixed or is not allowed maximum flexibility to vary as the approximate posterior varies one cannot end up in this situation. In this way, having a “suboptimal” prior is actually a requirement as the prior represents the assumed generative process which should include data not in the training set.
Now that the above elaboration is over, let’s try to frame the problem we’re trying to solve. We want a family of distributions that are essentially the same as a basic Gaussian (or other location-scale family), but are not necessarily unimodal. A natural choice for this would be a mixture distribution where each mixture component is equally likely, because there is no prior knowledge about the distribution of frequencies for the modes. Now we also probably want the underlying distributions we are taking a mixture of to be the same distribution as in the unimodal case (for the sake of exposition, let’s say Gaussian). The last condition is the trickiest one to model. From the perspective of each mode, we do not want to assume any kind of structure in the relationship over the the other modes. This is important because we want something that has a truly blank slate and does not bias us to consider certain modes as more central in the data and others as more exotic. To do this, we will use the Fisher-Rao distance from Information Geometry, and require the distance to be equal from each mixture component to each other mixture component. The significance of this choice of distance function is that it is the natural metric in Information Geometry, particularly as the exponential connection between distributions is concerned. To illustrate this naturalness, Chentsov’s theorem states that it is the unique metric on a statistical manifold up to rescaling that is invariant under sufficient statistics for the distributions in question. Lastly, because scale and location of the mixture will not impact the ultimate interpretation, we’d like to define the family up to simultaneously translating or scaling the components therein.
What does this family of distributions look like? Is there a natural parametrization? Luckily we can easily answer both of those questions. To answer the first question, consider the case in d dimensions. Let’s try to “fit” the maximum number of mixture components into this space such that our requirements are met. It turns out that when you do this, you end up with a regular simplex (think equilateral triangle) in the $d+1$-dimensional statistical manifold space of the base location-scale distribution. Wait, back up. First we have to define what a statistical manifold is, informally. A statistical manifold is the continuous space of valid parameters for distributions in a family, each corresponding to a distribution. In the case of Isotropic Gaussians (or any Location-Scale distributions in $d$ dimensions), the statistical manifold has $d$ dimensions for the location and $1$ dimension for the scale parameter. A good resource to understand Information Geometry and study this in greater depth is Frank Nielsen’s intro to Information Geometry [1]. An important thing to note is that under the Fisher-Rao Metric, the statistical manifold is equivalent to the half-plane model in hyperbolic geometry for Location-Scale distributions. That means the simplex in standard Euclidean space has an analogue with the same number of points, albeit in a space with negative curvature due to its hyperbolic nature. As a result, the maximum number of points such that each is equidistant to the others is $d+2$, and the parameter of interest is simply how large the distance is. In the limit case where all points are the same, it degenerates into no longer being a mixture distribution.
As a result, the answer to the first question is: the family of distributions looks like a (rotated, scaled) regular simplex in the hyperbolic half-plane of a specified number of dimensions d. A natural parameterization is therefore to specify d alongside a member of the group defining rotations/scalings in the statistical manifold space. Scalings remain simply parameterized by a scalar whereas rotations will be members of the rotation group in hyperbolic space of dimension d. If we want a quick and dirty parameterization of a subset that does not have the flexibility of the more general one but is easy to intuit, imagine you place a location of d components of the mixture distribution at a point $\left[ c, -c/d, -c/d, …, -c/d \right]$. Then do the same for each dimension, giving each dimension exactly $1$ location with a positive coordinate and $d$ with coordinate $-c/d$. They will all be equidistant from each other regardless of the scale parameter you give them so long as it is the same for all. Then place a distribution centered at zero with either a higher or lower scale than the other distributions such that the distance to the other distributions equals the distance between all pairs of the distributions. This construction can then be scaled so that this property is maintained. The overall mean of the mixture is zero and the variance is dependent upon the scale chosen. This distribution can be thought of as a canonical orientation, so any distribution in the family can be obtained by then rotating the distribution in hyperbolic space.
We can take a moment to refer back to the visualization at the top of the page. The wavelike motion of the three Gaussian mixture components is generated by considering $d=1$. The univariate Gaussian then leads to an equilateral triangle in hyperbolic space. We rotate that triangle and keep the distance between components the same, plotting the resulting mixture distribution in the 1d domain.
[1] F. Nielsen, An elementary introduction to information geometry