Grammar VAE [1] is an approach to generating strings that leverages grammatical rules to generate only syntactically valid strings. These rules are defined so long as the underlying language is context-free. The most interesting example from the paper, and the one that has been built upon [2], is that of de-novo drug design where the result is syntactically valid SMILES strings that represent molecules. However, if you attempt to use the same grammar the authors used to parse SMILES strings that aren’t in the dataset you will find that it is not fit for purpose because it was designed to handle the relatively small subset of ZINC that the authors trained on. Starting from the Balsa [3] grammar and the OpenSMILES spec [4], I have produced a new grammar that is able to parse all SMILES strings in BindingDB [5] after conversion to canonical form by rdkit [6]. Note that some of these strings have features that are discarded by rdkit, these features may be missing from this grammer (e.g. “ |r|” appeared at the end of some of them for some reason). Here is a link to a tiny github repo that has the most up-to-date version of the file.

Philosophically I did not follow the approach that others have in the construction of this grammar. Other SMILES grammars online are not able to parse all of the strings that this grammar can because they try to model molecules with explicit constraints on valence, and other chemistry baked in. The grammar is the wrong layer of abstraction to add these constraints. The grammar should ensure that strings that do not represent something recognizable in the language are not accepted. Constraints on what a sensible input to the pipeline is, in this case applying the laws of chemistry, should be imposed after the parse has been completed. That is because the grammar is about hierarchy and syntactic correctness at expressing the discrete structure being represented. It is therefore certain to be overly constrained if its role is broadened. This can be seen by the fact that hypervalent molecules are present in more than a couple binding interaction experiments in BindingDB. Of course, a natural counter to this would be to broaden what is acceptable as counterexamples occur in the dataset, but that is not the approach I have taken.

A brief note on preprocessing. When retrieving canonical SMILES strings, I optimistically try to convert the original SMILES to molecule objects using rdkit with sanitize=True, and if that fails I try again with sanitize=False. This way I am able to get sanitized canonical smiles when available, but fall back to canonical smiles that have e.g. not been kekulized if it fails. This maximizes niceness for downstream tasks, while minimizing the number of molecules I have to exclude. Note that the only reason this step is mentioned is because it happens to eliminate some non-standard SMILES from the language when I use it.

SELFIES [7] is regarded by many as the current SOTA for representing molecules as strings for the purpose for downstream processing. The reason for this is that it is impossible to generate a syntactically invalid SELFIES string from its alphabet, and all syntactically valid SELFIES strings represent molecules. This is because they reduce the description of self reference to a wrapped offset, making all self-references valid descriptions of a graph. This dramatically expands the number of tokens in their language, but comes with the advantage of the aforementioned properties.

The reason why SMILES is useful is as a branching-off point. Yes, it has problems with the binding problem, more informally known as the problem where you have to give things names to reference them. These problems lead to syntacically valid SMILES that don’t create real graphs. No, these are not actually the biggest problems when generating SMILES strings because it is very simple to always generate correct self-references by just dropping unreferenced branches. The biggest advantage of SMILES over SELFIES is deferred branching. SELFIES does not use parentheses or other tokens that create constraints later in the string and instead uses lengths for rings, and bundles attributes of atoms with the atoms themselves. SMILES, on the other hand, can use a context-free grammar to have sane defaults. This reduces the number of branches that SMILES can go down relative to SELFIES at every token. In the future I may write more about the pros/cons of these representations, but this should be a big enough reason not to discard SMILES immediately.

To make this post complete I will also render the grammar as code so you don’t have to jump off this page just to read the grammar. The grammar is represented in lark format (essentially EBNF).

```
smiles: sequence+
sequence: atom ( union | branch | gap )*
union: bond? ( bridge | sequence )
branch: "(" ( dot | bond )? sequence ")" ( union | branch )
gap: dot sequence
atom: star | shortcut | selection | bracket
bracket: "[" isotope? symbol parity? virtual_hydrogen? charge? "]"
isotope: nonzero digit? digit?
symbol: star | element | selection
virtual_hydrogen: "H" nonzero?
charge: ( "+" | "-" ) nonzero?
bridge: nonzero | "%" nonzero digit
parity: "@" "@"?
star: "*"
dot: "."
shortcut: "B" | "Br" | "C" | "Cl" | "N" | "O" | "P" | "S" | "F" | "I" | "Sc" | "Sn"
selection: "b" | "c" | "n" | "o" | "p" | "s" | "se" | "as" | "te"
element: "Ac" | "Ag" | "Al" | "Am" | "Ar" | "As" | "At" | "Au"
| "B" | "Ba" | "Be" | "Bi" | "Bk" | "Br"
| "C" | "Ca" | "Cd" | "Ce" | "Cf" | "Cl" | "Cm" | "Co"
| "Cr" | "Cs" | "Cu"
| "Dy"
| "Er" | "Es" | "Eu"
| "F" | "Fe" | "Fm" | "Fr"
| "Ga" | "Gd" | "Ge"
| "H" | "He" | "Hf" | "Hg" | "Ho"
| "I" | "In" | "Ir"
| "K" | "Kr"
| "La" | "Li" | "Lr" | "Lu"
| "Mg" | "Mn" | "Mo"
| "N" | "Na" | "Nb" | "Nd" | "Ne" | "Ni" | "No" | "Np"
| "O" | "Os"
| "P" | "Pa" | "Pb" | "Pd" | "Pm" | "Po" | "Pr" | "Pt" | "Pu"
| "Ra" | "Rb" | "Re" | "Rf" | "Rh" | "Rn" | "Ru"
| "S" | "Sb" | "Sc" | "Se" | "Si" | "Sm" | "Sn" | "Sr"
| "Ta" | "Tb" | "Tc" | "Te" | "Th" | "Ti" | "Tl" | "Tm"
| "U" | "V" | "W" | "Xe" | "Y" | "Yb"
| "Zn" | "Zr"
bond: "-" | "=" | "#" | "/" | "\\" | ":"
digit: "0" | nonzero
nonzero: "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
```

The way that Grammar VAE uses this grammar is by taking the application of grammar rules to be tokens themselves. In this way, the parse tree can be produced node-by-node, and at every step invalid next nodes can be masked by simply masking the token. An advantage of this approach is that no probability mass goes to producing syntactically invalid strings. However, in practice, I have found this approach does not provide better ELBO bounds than attempting to generate arbitrary character strings. This is because the Grammar VAE strings are significantly longer, so while they only output “useful” strings, they have a much higher number of places where things can “go wrong”. That is probably a good tradeoff to make. However, there is an approach that gets the best of both worlds, but that is for the next post in this series.

[1] M. Kusner et. al., Grammar Variational Autoencoder

[2] W. Jin et. al., Junction Tree Variational Autoencoder for Molecular Graph Generation

[3] R. L. Apodaca, Balsa: A Compact Line Notation Based on SMILES

[4] C. James et. al., OpenSmiles

[5] BindingDB

[6] RDKit

[7] Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation

]]>The Polyak stepsize (not to be confused with Polyak’s acceleration) is a principled method to update the step size for convex problems. In its default state, it does not require a line search, but it can be used as an initial guess in one. The method requires that the user know the optimal value $f^* = f(x^*)$, but not $x^*$ for some convex minimization problem a-priori. By using this infomation, a stepsize $\eta_t$ is chosen at each step $t$ using the objective $f_t$ evaluated at its vector argument $x_t$.

\begin{equation} \eta_t = \frac{f_t-f^*}{\lVert \nabla f_t \rVert} \end{equation}

This is a very old method [1], however it has gained some recent attention in Machine Learning particularly as it is often known a-priori that the loss will be the minimum possible, $0$. This is called the interpolation-regime and why/when modern machine learning models, particularly neural networks, are able to perform well in it is an active subject of research. Given this knowledge, techniques like ALI-G [2] take this very simple stepsize argument and modify SGD to use a stepsize of $\min \left[ \alpha*\eta_t, \gamma \right]$ for some constants $\alpha < 1, \gamma$ that control how large the stepsize can get. It has been observed that this usually causes the stepsize to quickly rise and then gradually fall as the model is fit. The reason $\alpha, \gamma$ are needed is not only that we are sampling subsets of the data to compute gradients, but also because these problems are inherently nonconvex, which technically makes these techniques no longer rigorously “work”.

In this note, I will propose that there is more information that can be used in these interpolatory regime cases that allows for an even better first-order stepsize method to be constructed. Specifically I will prove how there is a method that can use information on the optimal value attained at the optimum for each datum separately (assuming it is available). This method follows simply from adding this information to the proof that the Polyak stepsize minimizes an inequality that comes directly from convexity. This assumes that the function being optimized is a sum of convex functions $\sum_{i=1}^{n} f_i$. We will write $f^*_i$ to denote $f_i(x^*)$. Note that there is only one $x^*$ here, the one that optimizes the overall sum, and this technique will work whenever we know these values a-priori, so it may apply in settings outside of the interpolation regime. Specifically, the easiest source to work from is probably [3] that has nice proofs of the stepsize as well as rates and a method to adaptively find $f^*$ when it *isn’t* known a-priori. Reproducing (3) from that paper, but with a new subscript $i$ where relevant, we have: (pardon the formatting, MathJax is acting up…)

\begin{equation}
d_{t+1}^2 = \lVert x_{t+1} - x^* \rVert^2 \nonumber

\end{equation}
\begin{equation}
= \lVert x_t - \sum_i \eta_{t,i} \nabla_{t,i} - x^* \rVert \nonumber

\end{equation}
\begin{equation}
= d_t^2 - 2 \sum_i \eta_{t,i} \nabla_{t,i} (x_t - x^*) + \sum_j (\sum_i \eta_{t,i} \nabla_{t,i}^{(j)})^2 \nonumber

\end{equation}
\begin{equation}
\leq d_t^2 - 2 \sum_i \eta_{t,i} h_{t,i} + \sum_j (\sum_i \eta_{t,i} \nabla_{t,i}^{(j)})^2 \label{basic}
\end{equation}

This results in a quadratic minimization problem that naturally yields a solution as a solution of linear equations when we set the gradient to zero. When we put those equations together, we get the solution has a familiar form:

\begin{equation} \eta_{t,*} = (\nabla_{t,*}\nabla_{t,*}^\top)^{-1}h_{t,*} \end{equation}

So, how well does it perform compared to standard Polyak gradient? I’ll lead with the bad news: for problems that naturally compose into larger problems of the same class when summed (convex quadratic minimization, f.e.), it doesn’t seem to work better than Polyak stepsize. When trying it out on a small problem in this class with $5$ functions and $40$ variables, both methods take 27 steps to reach an accuracy one might call converged ($1e-6$).

When we choose a funkier problem class is where this method shines (which is good, because neural networks are quite funky). I minimized the sum of $1$ quadratic and $4$ exponentiated-quadratics (exp preserves convexity) and found that neither method converged. In the first $10$ steps, however, this stepsize achieves an objective value around 100% larger than the target (8.8k vs 5k) that is able to be reduced slightly in 50 more (7.7k), unfortunately after that there is a numerics problem and the method wildly diverges, reaching a value >100k by 100 iterations. This is a problem, but let’s compare that to how Polyak step sizes do. In $10$ steps, Polyak has a loss of 44.8k, in $60$, 42.0k, and in 400, 17.6k. Not too shabby for our method.

Interestingly, Polyak step sizes results in a distance in $x$-space that is closer to $x^*$ by a proportion of roughly 1/3 when compared to Multiple-Polyak (ours). That could be because Multiple Polyak is better able to find curvature in the loss which allows it to minimize the loss faster at the expense of slower descent in parameter space. If that is a problem, it’s probably not worth it to use this method. This also means that either the bound is very loose in this case, so optimizing it is not actually producing gains, or numerics are causing a problem even earlier than the point at which the algorithm fails outright.

In conclusion I’ve introduced an interesting new method for automatically determining the learning rate when the aim is to find a solution in the interpolation regime. Next update I will show how this can be plugged into an adaptive method (variant of Adam) and used to train neural networks.

[1] B. Polyak, Introduction to optimization. translations series in mathematics and engineering.

[2] Berrada et. al., Training Neural Networks for and by Interpolation

[3] E. Hazan and S. Kakade, Revisiting the Polya Step Size

]]>This image will make sense after you’ve read the post, but I’m putting it up top as a fun thing to look at in an otherwise quite dry post.

To set the stage, imagine you are using a VAE, GAN, or other latent variable model. It’s working great, but there’s a thought gnawing at the back of your head.

“What if this distribution and all of the interpolation assumptions I am using to train my models are forcing a kind of unimodality?”

So you might reach for a trainable mixture prior, randomly initialized, or maybe you have read some of the literature and go for a VampPrior. But now you have a a new worry.

“My new prior is kind of a black box. I just wanted to express the fact that I expect a potentially multimodal distribution and I ended up with something so general that I’m worried the prior might be overfitting to the data now.”

A lot of papers talk about the amortized approximate posterior as the “optimal prior” for a VAE. This is a misnomer, except in a very limited sense. It is true that, given the set of approximate posteriors, the prior which is optimal when optimizing for performance on the ELBO for the training data will be the amortized posterior. However, this prior will overfit the training data when it is allowed to be dynamic so long as the posterior is expressive enough and the training procedure optimal enough. The scenario is as follows: The approximate posterior converges to a dirac delta for each data point s.t. no two overlap. The amortized approximate posterior is then a distribution which assigns mass $1/n$ to each data point, incurring an optimal cost of $\log(n)$ in the ELBO. We are now training a pure autoencoder with no regularization and our generative model is just to sample a point from our training set and pass it through said autoencoder, which likely achieves a perfect fit (for any good decoder this should be true). With a more “suboptimal” prior that is fixed or is not allowed maximum flexibility to vary as the approximate posterior varies one cannot end up in this situation. In this way, having a “suboptimal” prior is actually a requirement as the prior represents the assumed generative process which should include data not in the training set.

Now that the above elaboration is over, let’s try to frame the problem we’re trying to solve. We want a family of distributions that are essentially the same as a basic Gaussian (or other location-scale family), but are not necessarily unimodal. A natural choice for this would be a mixture distribution where each mixture component is equally likely, because there is no prior knowledge about the distribution of frequencies for the modes. Now we also probably want the underlying distributions we are taking a mixture of to be the same distribution as in the unimodal case (for the sake of exposition, let’s say Gaussian). The last condition is the trickiest one to model. From the perspective of each mode, we do not want to assume any kind of structure in the relationship over the the other modes. This is important because we want something that has a truly blank slate and does not bias us to consider certain modes as more central in the data and others as more exotic. To do this, we will use the Fisher-Rao distance from Information Geometry, and require the distance to be equal from each mixture component to each other mixture component. The significance of this choice of distance function is that it is the natural metric in Information Geometry, particularly as the exponential connection between distributions is concerned. To illustrate this naturalness, Chentsov’s theorem states that it is the unique metric on a statistical manifold up to rescaling that is invariant under sufficient statistics for the distributions in question. Lastly, because scale and location of the mixture will not impact the ultimate interpretation, we’d like to define the family up to simultaneously translating or scaling the components therein.

What does this family of distributions look like? Is there a natural parametrization? Luckily we can easily answer both of those questions. To answer the first question, consider the case in d dimensions. Let’s try to “fit” the maximum number of mixture components into this space such that our requirements are met. It turns out that when you do this, you end up with a regular simplex (think equilateral triangle) in the $d+1$-dimensional statistical manifold space of the base location-scale distribution. Wait, back up. First we have to define what a statistical manifold is, informally. A statistical manifold is the continuous space of valid parameters for distributions in a family, each corresponding to a distribution. In the case of Isotropic Gaussians (or any Location-Scale distributions in $d$ dimensions), the statistical manifold has $d$ dimensions for the location and $1$ dimension for the scale parameter. A good resource to understand Information Geometry and study this in greater depth is Frank Nielsen’s intro to Information Geometry [1]. An important thing to note is that under the Fisher-Rao Metric, the statistical manifold is equivalent to the half-plane model in hyperbolic geometry for Location-Scale distributions. That means the simplex in standard Euclidean space has an analogue with the same number of points, albeit in a space with negative curvature due to its hyperbolic nature. As a result, the maximum number of points such that each is equidistant to the others is $d+2$, and the parameter of interest is simply how large the distance is. In the limit case where all points are the same, it degenerates into no longer being a mixture distribution.

As a result, the answer to the first question is: the family of distributions looks like a (rotated, scaled) regular simplex in the hyperbolic half-plane of a specified number of dimensions d. A natural parameterization is therefore to specify d alongside a member of the group defining rotations/scalings in the statistical manifold space. Scalings remain simply parameterized by a scalar whereas rotations will be members of the rotation group in hyperbolic space of dimension d. If we want a quick and dirty parameterization of a subset that does not have the flexibility of the more general one but is easy to intuit, imagine you place a location of d components of the mixture distribution at a point $\left[ c, -c/d, -c/d, …, -c/d \right]$. Then do the same for each dimension, giving each dimension exactly $1$ location with a positive coordinate and $d$ with coordinate $-c/d$. They will all be equidistant from each other regardless of the scale parameter you give them so long as it is the same for all. Then place a distribution centered at zero with either a higher or lower scale than the other distributions such that the distance to the other distributions equals the distance between all pairs of the distributions. This construction can then be scaled so that this property is maintained. The overall mean of the mixture is zero and the variance is dependent upon the scale chosen. This distribution can be thought of as a canonical orientation, so any distribution in the family can be obtained by then rotating the distribution in hyperbolic space.

We can take a moment to refer back to the visualization at the top of the page. The wavelike motion of the three Gaussian mixture components is generated by considering $d=1$. The univariate Gaussian then leads to an equilateral triangle in hyperbolic space. We rotate that triangle and keep the distance between components the same, plotting the resulting mixture distribution in the 1d domain.

[1] F. Nielsen, An elementary introduction to information geometry

]]>Here is a link to the page describing the project from HIAS: link, ctrl+f for “RUTH” to see how my work fits into the bigger picture of Welcome Circles.

For now I will direct you to the ArXiv entry for the paper, in the future this page will have a more detailed explanation: link.

]]>These thoughts are slanted towards VAEs as those models are the ones with which I have the most experience. The first post is best viewed as a highly-opinionated refresher on what we are trying to do, why, and common misconceptions. Feel free to skip to the misconceptions as the rest of the post treads old ground and there are likely a hundred thousand summaries just like it.

What are generative models? They are parametric statistical models mapping from an input space (with points sampled assumed “prior” distribution) to an output space, sometimes stochastically, sometimes implicitly. To give an example, this might look like drawing a sample from a p-dimensional isotropic Normal distribution and passing that through a neural network to map to a k-dimensional image in pixel space. Or it could be that we sample from a mixture distribution with learned parameters and map to logits defining a categorical distribution for each token in a sequence, that we then sample from to arrive at the final samples from our model.

There are multiple explanations of why this is a significant/important approach when modeling data. From one perspective, generative models are just compression algorithms. This perspective used to not be so popular, but since the advent of LLMs it is gaining traction. From another, generative models are defined by a functional use case - if successful, they allow you to quickly gather a massive number of typical (for some definition of typical) candidates from the set that your training data comes from. Lastly, generative models can be thought of as a way to arrive at some lower-dimensional manifold that is easier to interpret.

Why might we choose one generative modeling framework or lens over another? I will briefly discuss some pros and cons. Directly optimizing negative loglikelihood (NLL) is not usually available outside of very simple models that are more in the domain of classical statistics.

For VAEs that maximize a loglikelihood lower bound by using a stochastic encoder/decoder, there are multiple problems arising naturally. One is posterior collapse where the encoder converges to the same distribution for all samples that equals the prior. Another, not discussed in the literature, is mixture posterior collapse which is the same concept except the collapse occurs to a mode of the prior mixture. A third, also not discussed as much as it should be, is the problem of underfitting caused by high variance gradients from stochasticity.

GANs, another type of generative model, have ridden the wave of popularity and right now are comparitively unpopular. Optimizing GANs involves solving a bi-level optimization problem where the generator and discriminator functions are optimized to beat each other at the “game” of discerning training data from generated. They have a problem known as mode collapse (or more generally dropping modes). This is because they must optimize symmetric divergences and these essentially treat the training data as another side of the same coin to generated data. So in the process of optimization entire modes of the training data may end up not being generated if the discriminator is more successful at discerning them.

EBMs define an implicit distribution by mapping to real numbers that correspond to log probabilities up to a constant (and knowing that constant is in general intractable for the models where direct NLL optimization is). EBMs have no easy sampler the way that the more direct models do. Additionally, because they are so flexible, EBMs do not really define a model class in the same way. Some literature trains EBMs similarly to GANs, others extract conditional EBMs from classifiers, and yet more train EBMs by using whatever method for sampling they have and “pushing” up/down the energy of real/fake samples. I will not focus much on EBMs as they are the model type I have the least experience with.

There are many other kinds of models, and I have left out issues with Diffusion Models and LLMs (autoregressive transformers) because those are probably deserving of their own post. I hope you have a high level idea of the kinds of topics that might be covered in this series.

- Common Misconception 1: VAEs generate “blurry” samples.

You may already know this, but VAEs generate the exact opposite of blurry samples. VAEs generate highly noisy samples for image data when you train them the standard way with the standard posterior parameterization $ p_\theta(x|z) \sim \mathcal{N}(\mu(z), diag(\sigma(z))) $. This is because the model is leaving a large amount of the loss in the reconstruction term. As a result, the variance is high. A high variance leads to a lot of noise in the resulting image. Images with a lot of uncorrelated gaussian noise in pixel space have an uncharacteristically large high frequency component because the noise applies evenly across all frequencies whereas most image information is low frequency. So why do people think VAEs generate blurry samples (which have uncharacteristically low high frequency components)? That is because people think $\mu(z)$ is a sample from a VAE. VAEs are not trained to optimize the distribution of $\mu(z)$, they are trained to optimize the full reconstruction distribution and that includes the noise term. Please stop discarding the noise and pretending the problem is blur. And if you’re not generating image data, please do not evaluate the quality of the most likely sample (mode) from $p(x|z)$. Please draw a sample from $p(x|z)$ when evaluating samples.

- Common Misconception 2: Posterior Collapse Occurs because X.

Right now there is no single explanation for why posterior collapse occurs that satisfactorily explains all observed cases. That is because there are actually a lot of things that cause posterior collapse. And as I will describe later, there are ways to reliably avoid posterior collapse and train VAEs to much higher fidelity than what most people are getting. Because I will devote an entire post to this, I will leave it at that for now.

- Common Misconception 3: Frechet X Distance is a perfect metric and we don’t need to report marginal likelihood.

If your generative model supports obtaining an ELBO or marginal likelihood estimate, you should report that at a minimum. It is the gold standard for models that support it and unlike using the Frechet distance to the data from your dataset fed through some arbitrary architecture objectively measures something (compression performance). There is no objective use for Frechet Neural Network Distances, they don’t even guarantee that the generated data will obtain similar classification accuracies. Frechet Distance on activations does not bound output distances. Report FID/FCD/FXD as it is a standard metric, but just know that it is not as useful as more objective metrics that actually tell you something about what your model is guaranteeing. It’s also certainly less useful than contextualizing your model in a downstream task performance metric, although that can be tricky.

- Common Misconception 4: KL divergence is just another divergence and slotting in something else without justifying the decision to do so is valid.

KL divergence holds a unique place among divergences. I may later go over my top 4 divergences and why they’re above the rest, but KL is truly the most important. To keep things concise, I’ll just give a skeleton answer as to why you should not treat it like just any divergence. KL divergence is the only F-divergence that’s also a Bregman divergence, yes that’s true, but it doesn’t really mean that much if you don’t need qualities from both. Instead, the best justifications for KL divergence come from Sanov’s theorem and likelihoodist thought. Sanov’s theorem says roughly that KL divergence is the unique function of two distributions that bounds how much an observed frequency distribution (sample) can differ in large deviations from the underlying distribution (population). Minimizing KL divergence also corresponds to maximizing likelihood, and therefore there are tons of proofs of nice properties, one of which is that it’s parameterization invariant. Again, this will be covered in more depth in a later post, but hopefully readers of this post will at least look up Sanov’s theorem.

As to the need for justification, it’s generally bad practice to make a decision to modify something without an explanation as to how said decision might be helpful. This is common sense, but it is not always obvious and sometimes goes overlooked.

- Common Misconception 5: Flows, GANs, and VAEs are dead, we should all move on to only studying Diffusion Models and Autoregressive Transformers (LMs).

This is not a technical misconception but I would just like to point out that: 1) Diffusion Models have a lot of overlap with VAEs and Flows. 2) At peak GAN hype people were telling me that GANs were the be all end all of generative modeling and we should all abandon likelihood-based models. Basically, my point is to study what you want to study, ignore the sentiment of beginners, and don’t be surprised if conventional wisdom is wrong when it comes to the limits of effectiveness of things. As long as you try to be objective about evaluating methods you will learn things that are not common knowledge.

- Common Misconception 6: ReLU is a good activation for generative models.

None of the best generative models use default ReLU. The closest is leaky ReLU, which can be quite good. Many use GeLU, some use tanh, some ELU, etc. You can try ReLU, just don’t be surprised if it is inferior to other activations. This is more on the art side of things but usually if a paper is e.g. a VAE model using ReLU you can be sure that it isn’t very tuned.

- Common Misconception 7: If your generative model is biased, the problem is in the data or randomness and not your modeling.

Ok, I just can’t resist this one. GANs are known to drop modes. If you use GANs and find that you’ve dropped a mode for a minority group, you can’t claim that there was no way you could expect that to happen, or that all generative models would have the same problem, or even that more/cleaner data from the same source would necessarily fix it. This is a known problem with GANs so don’t use them if you’re doing something with social ramifications.

This isn’t to say that GANs are always bad, they’re just not always fit for purpose and have glaring weaknesses that make it clear where their weaknesses are. It’s harder to pick on people for blaming data/randomness for the failures of other model categories because those are less well known.

- Common Misconception Last: This list is not going to be updated as I encounter new misconceptions.

Please check back or subscribe to my RSS feed for updates to this list. I hope you enjoyed it, and see you in the next part of the series wherein I get deep into the weeds on generalization in generative modeling.

]]>My Master’s Thesis project at WPI was the construction of a novel method for solving a particular class of Integer Polynomial Program that supported exact solutions for a range of inequality right hand side constraints (a range of budgets), in contrast to prior approaches which focus on solving for only a fixed right hand side or only support Integer Linear Programs. I will take a step back shortly and explain what all of that means (or you can read the full thesis for a more in-depth look at things), but first I will explain the results. By employing the new method, enabled by the intelligent use of an R* tree data structure, that I called a “Minimal R* Tree”, I was able to solve for thousands of right hand sides on Quadratic Knapsack Problems generated by a standard construction in the time it would usually take to solve for ~30-40 conservatively. These problems are NP-hard and can take minutes or hours to solve exactly for even modest problem sizes.

This post will be added to when I get the time to update it with the details and ideally include additional improvements I have made post-thesis. Unfortunately it is a quite in depth topic (my thesis is over 60 pages, and somewhat information dense) so my initial go at describing it was short-lived.

]]>In 2017, I completed an undergraduate thesis-equivalent project focused on a problem in medical imaging. At the time, automated diagnoses of diseases directly from imaging data using deep learning was a relatively new innovation. In particular, there was a significant confidence gap held by doctors about the effectiveness of these methods because there was no clear understanding of how the deep learning classifiers arrived at their conclusions. This gap is why deep learning classifiers are typically referred to as “black boxes”. My thesis was to address this gap on a specific task by both diagnosing brain diseases and at the same providing explanatory visualizations. The specific task was diagnosing fMRI brain images from a UMass Medical dataset of youth brain-disease into brain-disease categories. After developing the neural network architectures the visualization component used what were at-the-time state of the art techniques in neural network visualization.

My team (Miya Gaskell, Ezra Davis, and myself) worked under the supervision of Prof. Xiangnan Kong. The project was the first experience with neural networks, convolutional layers, and visualization for all three members of the team. As part of what we delivered we produced a thesis paper and a poster which we presented to the WPI faculty at project presentation day.

We were fortunate to be able to work with UMass Medical for this project through an ongoing collaboration Prof. Kong had with them. All of the data was provided to us by Mass Medical’s Center for Comparative Neuroimagaing, the subjects of which were youths undergoing a brain-disease The dataset of 88 patients was train/test split patient-wise i.e. all of the fMRI frames from an individual patient were either contained entirely in the training or testing set in order to prevent data leakage. This is important because one of our architectures classified on individual frames, which dramatically increased the size of the dataset, and had we not prevented the leakage this way and used the same splits we would not be able to properly evaluate our classifier’s effectiveness. Because this was a small but high quality dataset, we acknowledged the need for more data to make the study truly take advantage of the power of a deep-learned classifier. Additionally, the lack of data creates trouble extrapolating our results to the clinical setting. Nevertheless our approach could be directly applied, including visualization techniques, to any large fMRI classification dataset.

Although at this point neural networks are somewhat common knowledge, it is important to describe what we learned over the course of this project. To do that I will define the terms informally, using language from the report. Credit goes to Ezra for this specific explanation.

*Artificial neural networks* are one of many machine learning techniques that are primarily used for classifying data. Originally, this technique was inspired by biological neural networks. One common example of artificial neural networks using a dataset of many of images of cats and dogs, you could use an artificial neural network to recognize and correctly identify if a new photo is of a cat or a dog. For our purposes, we use artificial neural networks to help diagnose patients based on their brain scans. Neural networks form their predictions based on a training process – you show examples of both the input and output (e.g. patient A has disorder 1, patient B has disorder 2, etc.) again and again, and the network gets better at predicting the output given the inputs as its guesses get more accurate. The training process is rarely perfect, but it can create a neural network that is quite good at predicting the output for any new input that’s not in the training set.

*Convolutional neural networks* are a specific type of neural network that is usually used for classifying images. Traditional neural networks, when given an image, consider each pixel’s location as a separate input, and if the image is shifted one pixel over, this is an entirely different image. Convolutional neural networks (CNNs), however function differently. CNNs (like many modern traditional neural networks) are typically comprised of several different layers, each of which does some small piece of the analysis, with the first layer performing simple analysis and feeding its results into the next layer. This is called deep learning. Each convolutional layer consists of a number of filters (typically 16 or 32) that look at a a part of the image at a time (commonly 5×5 or 3×3 pixels, smaller areas compute faster). Each filter lights up under different conditions – for instance, one filter could detect edges in an image, whereas its neighboring filter detects circles. The output of these filters (each of which is roughly the size of the original image) are fed into the next convolutional layer (typically with simpler non-convolutional layers in between), which has filters that can predict more complex features (e.g. faces of cats versus faces of dogs).

In our cats and dogs example the algorithm is ultimately trying to predict whether the photo contains a cat or a dog. In more traditional image analysis techniques, the programmer might explicitly set up a set of rules for differentiating pictures of cats from dogs, such as the shape of the ears or if the animal is wearing a collar. In an artificial neural network, the computer decides on what features are important on its own, not communicating these features back to the programmer. Artificial neural networks act like a black box – we can see the the inputs and outputs, but the internal workings are hidden from view.

Convolutional neural networks are great at classifying images and other spatial data (such as fMRI brain scans), but it can be hard to understand how they really work. Because the training process is intentionally randomized and the “learning” process is automated, it can be hard to figure out exactly what each filter is trying to detect. There are a number of visualization techniques for exactly that purpose (such as those in Zeiler and Fergus, 2013, Visualizing and Understanding Convolutional Networks), but before our project, we did not find any for 3D data.

We ultimately settled on two architectures for our deep neural networks.

The first was a 3D CNN that we would run at each frame of video (no notion of time, but excellent spatial perception), and the end product was to take the mean probabilities across the whole time series as our final predictions. We develop a max-activation map and a guided saliency map video to enhance confidence in the accuracy of the method. Because the guided saliency map was based upon a per-frame classification with no constraint that the classes be temporally consistent, we were surprised to find that the video did not have a lot of flickering or other temporal artifacts and usually resembled small portions of salient activation moving smoothly around the brain volume.

The second was a 1D CNN that separated mean activations over expert-chosen brain regions. This had the advantage of explicitly incorporating interactions spatially as well as temporal patterns, but the disadvantage of no within-region resolution. Ultimately this performed worse than the 3D CNN approach in pure accuracy. The visualization for the architecture took advantage of the region mapping and produced a graph of key activation-interactions overlaid on the brain itself by thresholding a saliency map.

Expanding upon the brief summaries of what we delivered above, I’d like to show some concrete examples.

First, our max-patch activation map. Each feature map (collection of filters) in our convolutional layers is a linear function of its input. To each image patch (across all input channels) to which it is applied, the feature map will produce a single output activation. The strength of this activation is determined by how aligned the feature map is with its input. Given a specific layer in our neural network, our max-patch visualization visualizes the location in the image that is producing the highest activation for each feature map at that layer. If used on multiple samples, early works on neural network visualization such as those by Chris Olah have shown that this can give a feel for what is qualitatively important in the sense of e.g. “this is a neuron that looks for sharp edges” or “this is a neuron that looks for thin squiggles”. While it may be overly reductionist now, a common viewpoint is that earlier layers in the network will detect features that are strictly less abstract (lower level) than later layers that build upon their representations.

Shown above is an example of max-patch on image data from a digit classifier (MNIST).

Because a single patch is not sufficient to reason about a single sample, we choose instead to look at the top 10 patches when applying it to aid confidence that the network is zeroing in on salient parts of an fMRI frame. The idea here is that the doctor would compare these patches to which they personally might think could ever be salient to determine if the network is focusing on a reasonable set of locations in the brain.

Second, I will outline our guided saliency map. Saliency maps use the method by which neural networks are typically trained to show what features in the input space would most quickly change the class assigned to an image. We back-propagate a unit gradient in only the class of interest to arrive at the map telling us which features we should change to most quickly increase how much log-probability our network assigns to that class. We also back-propagate a leave-one-out map of all but that feature and visualize that using the same technique. Because neural networks are very nonlinear these maps can be nonsensical by the time they reach the image and are unlikely to be sparse. A quick trick is to threshold the gradient in image space as we back-propagate to only be positive. The idea behind this uses the fact that each linear transformation thresholds the input to be positive. Therefore, positive gradients do not have the same propagation issues that negative ones do in that there is no upper bound to how much more positive an input representation could be in contrast to negative gradients. Additionally by “choosing a direction” positive-only gradients sparsify the saliency maps.

After creating the saliency maps and guided-back-propagating them all the way to the input frame for each input in a sequence, the new sequence of maps can be transformed to a video by choosing a view (in our case a standard cross-sectional diagnostic view) and rendering a video.

Shown above is a single frame from one such video showing how the saliency map suggesting the most salient region of a frame from our classifier’s perspective.

Lastly, for the time series representations our activation-interaction maps, displayed below, are useful in ascertaining which correlations between regions our classifier looks at. Because each region is a 1D time-series, and our network is run on all time series at the same time, we can determine correlations between the series that are important by looking at large activations at a specific point in our architecture. Specifically, the architecture begins with isolated convolutional layers that only apply to one region at a time. After a certain point, we consider all-pairs and produce n-choose-2 channels by combining the input channels using a weighted sum (weights learned for all pairs). By looking at the strength of the resultant activations and choosing only those outside of a region around zero, important connections can be separated from the connections that have comparatively low-strength activations.

Ultimately, we were able to show via our held-out test set that our 3D approach classified better on the 4D fMRI data than what was reported as the median accuracy obtained by doctors. This comes with a somewhat large caveat that we are comparing our brain disease classification task to diagnosing all brain diseases and also that our results are highly limited by the size of our dataset. Because all patients were from the same study with the same setup and people involved, it is unlikely that we fit to particularities of the setup itself.

A more interesting result is that we found a brain region, and motifs in that region, that was being focused on by our guided backpropagation (the parietal lobe) which matches research from 2013 on predicting bipolar depression. This was done by 3 undergraduate students without prior medical training and this made us excited about the potential of neural network visualization techniques for hypothesis generation in the medical field.

I want to thank my teammates Miya Gaskell and Ezra Davis, and our advisor Professor Kong for all the hard work they put in to make the project work. Additonally, I want to thank Constance Moore, UMass Medical School’s Center for Comparative Neuroimaging, and the patients and medical staff for providing the brain scans to make this research possible.

]]>