I have learned a lot about generative modeling in the process of my research experience at Ericsson Research as well as on a side project wherein I employed it to generate small molecules.

These thoughts are slanted towards VAEs as those models are the ones with which I have the most experience. The first post is best viewed as a highly-opinionated refresher on what we are trying to do, why, and common misconceptions. Feel free to skip to the misconceptions as the rest of the post treads old ground and there are likely a hundred thousand summaries just like it.

What we are trying to do

What are generative models? They are parametric statistical models mapping from an input space (with points sampled assumed “prior” distribution) to an output space, sometimes stochastically, sometimes implicitly. To give an example, this might look like drawing a sample from a p-dimensional isotropic Normal distribution and passing that through a neural network to map to a k-dimensional image in pixel space. Or it could be that we sample from a mixture distribution with learned parameters and map to logits defining a categorical distribution for each token in a sequence, that we then sample from to arrive at the final samples from our model.


There are multiple explanations of why this is a significant/important approach when modeling data. From one perspective, generative models are just compression algorithms. This perspective used to not be so popular, but since the advent of LLMs it is gaining traction. From another, generative models are defined by a functional use case - if successful, they allow you to quickly gather a massive number of typical (for some definition of typical) candidates from the set that your training data comes from. Lastly, generative models can be thought of as a way to arrive at some lower-dimensional manifold that is easier to interpret.

Training problems

Why might we choose one generative modeling framework or lens over another? I will briefly discuss some pros and cons. Directly optimizing negative loglikelihood (NLL) is not usually available outside of very simple models that are more in the domain of classical statistics.

For VAEs that maximize a loglikelihood lower bound by using a stochastic encoder/decoder, there are multiple problems arising naturally. One is posterior collapse where the encoder converges to the same distribution for all samples that equals the prior. Another, not discussed in the literature, is mixture posterior collapse which is the same concept except the collapse occurs to a mode of the prior mixture. A third, also not discussed as much as it should be, is the problem of underfitting caused by high variance gradients from stochasticity.

GANs, another type of generative model, have ridden the wave of popularity and right now are comparitively unpopular. Optimizing GANs involves solving a bi-level optimization problem where the generator and discriminator functions are optimized to beat each other at the “game” of discerning training data from generated. They have a problem known as mode collapse (or more generally dropping modes). This is because they must optimize symmetric divergences and these essentially treat the training data as another side of the same coin to generated data. So in the process of optimization entire modes of the training data may end up not being generated if the discriminator is more successful at discerning them.

EBMs define an implicit distribution by mapping to real numbers that correspond to log probabilities up to a constant (and knowing that constant is in general intractable for the models where direct NLL optimization is). EBMs have no easy sampler the way that the more direct models do. Additionally, because they are so flexible, EBMs do not really define a model class in the same way. Some literature trains EBMs similarly to GANs, others extract conditional EBMs from classifiers, and yet more train EBMs by using whatever method for sampling they have and “pushing” up/down the energy of real/fake samples. I will not focus much on EBMs as they are the model type I have the least experience with.

There are many other kinds of models, and I have left out issues with Diffusion Models and LLMs (autoregressive transformers) because those are probably deserving of their own post. I hope you have a high level idea of the kinds of topics that might be covered in this series.


  • Common Misconception 1: VAEs generate “blurry” samples.

You may already know this, but VAEs generate the exact opposite of blurry samples. VAEs generate highly noisy samples for image data when you train them the standard way with the standard posterior parameterization $ p_\theta(x|z) \sim \mathcal{N}(\mu(z), diag(\sigma(z))) $. This is because the model is leaving a large amount of the loss in the reconstruction term. As a result, the variance is high. A high variance leads to a lot of noise in the resulting image. Images with a lot of uncorrelated gaussian noise in pixel space have an uncharacteristically large high frequency component because the noise applies evenly across all frequencies whereas most image information is low frequency. So why do people think VAEs generate blurry samples (which have uncharacteristically low high frequency components)? That is because people think $\mu(z)$ is a sample from a VAE. VAEs are not trained to optimize the distribution of $\mu(z)$, they are trained to optimize the full reconstruction distribution and that includes the noise term. Please stop discarding the noise and pretending the problem is blur. And if you’re not generating image data, please do not evaluate the quality of the most likely sample (mode) from $p(x|z)$. Please draw a sample from $p(x|z)$ when evaluating samples.

  • Common Misconception 2: Posterior Collapse Occurs because X.

Right now there is no single explanation for why posterior collapse occurs that satisfactorily explains all observed cases. That is because there are actually a lot of things that cause posterior collapse. And as I will describe later, there are ways to reliably avoid posterior collapse and train VAEs to much higher fidelity than what most people are getting. Because I will devote an entire post to this, I will leave it at that for now.

  • Common Misconception 3: Frechet X Distance is a perfect metric and we don’t need to report marginal likelihood.

If your generative model supports obtaining an ELBO or marginal likelihood estimate, you should report that at a minimum. It is the gold standard for models that support it and unlike using the Frechet distance to the data from your dataset fed through some arbitrary architecture objectively measures something (compression performance). There is no objective use for Frechet Neural Network Distances, they don’t even guarantee that the generated data will obtain similar classification accuracies. Frechet Distance on activations does not bound output distances. Report FID/FCD/FXD as it is a standard metric, but just know that it is not as useful as more objective metrics that actually tell you something about what your model is guaranteeing. It’s also certainly less useful than contextualizing your model in a downstream task performance metric, although that can be tricky.

  • Common Misconception 4: KL divergence is just another divergence and slotting in something else without justifying the decision to do so is valid.

KL divergence holds a unique place among divergences. I may later go over my top 4 divergences and why they’re above the rest, but KL is truly the most important. To keep things concise, I’ll just give a skeleton answer as to why you should not treat it like just any divergence. KL divergence is the only F-divergence that’s also a Bregman divergence, yes that’s true, but it doesn’t really mean that much if you don’t need qualities from both. Instead, the best justifications for KL divergence come from Sanov’s theorem and likelihoodist thought. Sanov’s theorem says roughly that KL divergence is the unique function of two distributions that bounds how much an observed frequency distribution (sample) can differ in large deviations from the underlying distribution (population). Minimizing KL divergence also corresponds to maximizing likelihood, and therefore there are tons of proofs of nice properties, one of which is that it’s parameterization invariant. Again, this will be covered in more depth in a later post, but hopefully readers of this post will at least look up Sanov’s theorem.

As to the need for justification, it’s generally bad practice to make a decision to modify something without an explanation as to how said decision might be helpful. This is common sense, but it is not always obvious and sometimes goes overlooked.

  • Common Misconception 5: Flows, GANs, and VAEs are dead, we should all move on to only studying Diffusion Models and Autoregressive Transformers (LMs).

This is not a technical misconception but I would just like to point out that: 1) Diffusion Models have a lot of overlap with VAEs and Flows. 2) At peak GAN hype people were telling me that GANs were the be all end all of generative modeling and we should all abandon likelihood-based models. Basically, my point is to study what you want to study, ignore the sentiment of beginners, and don’t be surprised if conventional wisdom is wrong when it comes to the limits of effectiveness of things. As long as you try to be objective about evaluating methods you will learn things that are not common knowledge.

  • Common Misconception 6: ReLU is a good activation for generative models.

None of the best generative models use default ReLU. The closest is leaky ReLU, which can be quite good. Many use GeLU, some use tanh, some ELU, etc. You can try ReLU, just don’t be surprised if it is inferior to other activations. This is more on the art side of things but usually if a paper is e.g. a VAE model using ReLU you can be sure that it isn’t very tuned.

  • Common Misconception 7: If your generative model is biased, the problem is in the data or randomness and not your modeling.

Ok, I just can’t resist this one. GANs are known to drop modes. If you use GANs and find that you’ve dropped a mode for a minority group, you can’t claim that there was no way you could expect that to happen, or that all generative models would have the same problem, or even that more/cleaner data from the same source would necessarily fix it. This is a known problem with GANs so don’t use them if you’re doing something with social ramifications.

This isn’t to say that GANs are always bad, they’re just not always fit for purpose and have glaring weaknesses that make it clear where their weaknesses are. It’s harder to pick on people for blaming data/randomness for the failures of other model categories because those are less well known.

  • Common Misconception Last: This list is not going to be updated as I encounter new misconceptions.

Please check back or subscribe to my RSS feed for updates to this list. I hope you enjoyed it, and see you in the next part of the series wherein I get deep into the weeds on generalization in generative modeling.