A Grammar for SMILES

Grammar VAE is an important paper in applying deep learning to de-novo drug design. However, extending its approach to more varied datasets requires a new grammar.

Introduction

Grammar VAE [1] is an approach to generating strings that leverages grammatical rules to generate only syntactically valid strings. These rules are defined so long as the underlying language is context-free. The most interesting example from the paper, and the one that has been built upon [2], is that of de-novo drug design where the result is syntactically valid SMILES strings that represent molecules. However, if you attempt to use the same grammar the authors used to parse SMILES strings that aren’t in the dataset you will find that it is not fit for purpose because it was designed to handle the relatively small subset of ZINC that the authors trained on. Starting from the Balsa [3] grammar and the OpenSMILES spec [4], I have produced a new grammar that is able to parse all SMILES strings in BindingDB [5] after conversion to canonical form by rdkit [6]. Note that some of these strings have features that are discarded by rdkit, these features may be missing from this grammer (e.g. “ |r|” appeared at the end of some of them for some reason). Here is a link to a tiny github repo that has the most up-to-date version of the file.

Philosophy

Philosophically I did not follow the approach that others have in the construction of this grammar. Other SMILES grammars online are not able to parse all of the strings that this grammar can because they try to model molecules with explicit constraints on valence, and other chemistry baked in. The grammar is the wrong layer of abstraction to add these constraints. The grammar should ensure that strings that do not represent something recognizable in the language are not accepted. Constraints on what a sensible input to the pipeline is, in this case applying the laws of chemistry, should be imposed after the parse has been completed. That is because the grammar is about hierarchy and syntactic correctness at expressing the discrete structure being represented. It is therefore certain to be overly constrained if its role is broadened. This can be seen by the fact that hypervalent molecules are present in more than a couple binding interaction experiments in BindingDB. Of course, a natural counter to this would be to broaden what is acceptable as counterexamples occur in the dataset, but that is not the approach I have taken.

A brief note on preprocessing. When retrieving canonical SMILES strings, I optimistically try to convert the original SMILES to molecule objects using rdkit with sanitize=True, and if that fails I try again with sanitize=False. This way I am able to get sanitized canonical smiles when available, but fall back to canonical smiles that have e.g. not been kekulized if it fails. This maximizes niceness for downstream tasks, while minimizing the number of molecules I have to exclude. Note that the only reason this step is mentioned is because it happens to eliminate some non-standard SMILES from the language when I use it.

Why not just use SELFIES?

SELFIES [7] is regarded by many as the current SOTA for representing molecules as strings for the purpose for downstream processing. The reason for this is that it is impossible to generate a syntactically invalid SELFIES string from its alphabet, and all syntactically valid SELFIES strings represent molecules. This is because they reduce the description of self reference to a wrapped offset, making all self-references valid descriptions of a graph. This dramatically expands the number of tokens in their language, but comes with the advantage of the aforementioned properties.

The reason why SMILES is useful is as a branching-off point. Yes, it has problems with the binding problem, more informally known as the problem where you have to give things names to reference them. These problems lead to syntacically valid SMILES that don’t create real graphs. No, these are not actually the biggest problems when generating SMILES strings because it is very simple to always generate correct self-references by just dropping unreferenced branches. The biggest advantage of SMILES over SELFIES is deferred branching. SELFIES does not use parentheses or other tokens that create constraints later in the string and instead uses lengths for rings, and bundles attributes of atoms with the atoms themselves. SMILES, on the other hand, can use a context-free grammar to have sane defaults. This reduces the number of branches that SMILES can go down relative to SELFIES at every token. In the future I may write more about the pros/cons of these representations, but this should be a big enough reason not to discard SMILES immediately.

Code

To make this post complete I will also render the grammar as code so you don’t have to jump off this page just to read the grammar. The grammar is represented in lark format (essentially EBNF).

smiles: sequence+
sequence: atom ( union | branch | gap )*
union: bond? ( bridge | sequence )
branch: "(" ( dot | bond )? sequence ")" ( union | branch )
gap: dot sequence
atom: star | shortcut | selection | bracket
bracket: "[" isotope? symbol parity? virtual_hydrogen? charge? "]"
isotope: nonzero digit? digit?
symbol: star | element | selection
virtual_hydrogen: "H" nonzero?
charge: ( "+" | "-" ) nonzero?
bridge: nonzero | "%" nonzero digit
parity: "@" "@"?
star: "*"
dot: "."
shortcut: "B" | "Br" | "C" | "Cl" | "N" | "O" | "P" | "S" | "F" | "I" | "Sc" | "Sn"
selection: "b" | "c" | "n" | "o" | "p" | "s" | "se" | "as" | "te"
element: "Ac" | "Ag" | "Al" | "Am" | "Ar" | "As" | "At" | "Au"
                    | "B" | "Ba" | "Be" | "Bi" | "Bk" | "Br"
                    | "C" | "Ca" | "Cd" | "Ce" | "Cf" | "Cl" | "Cm" | "Co"
                    | "Cr" | "Cs" | "Cu"
                    | "Dy"
                    | "Er" | "Es" | "Eu"
                    | "F" | "Fe" | "Fm" | "Fr"
                    | "Ga" | "Gd" | "Ge"
                    | "H" | "He" | "Hf" | "Hg" | "Ho"
                    | "I" | "In" | "Ir"
                    | "K" | "Kr"
                    | "La" | "Li" | "Lr" | "Lu"
                    | "Mg" | "Mn" | "Mo"
                    | "N" | "Na" | "Nb" | "Nd" | "Ne" | "Ni" | "No" | "Np"
                    | "O" | "Os"
                    | "P" | "Pa" | "Pb" | "Pd" | "Pm" | "Po" | "Pr" | "Pt" | "Pu"
                    | "Ra" | "Rb" | "Re" | "Rf" | "Rh" | "Rn" | "Ru"
                    | "S" | "Sb" | "Sc" | "Se" | "Si" | "Sm" | "Sn" | "Sr"
                    | "Ta" | "Tb" | "Tc" | "Te" | "Th" | "Ti" | "Tl" | "Tm"
                    | "U" | "V" | "W" | "Xe" | "Y" | "Yb"
                    | "Zn" | "Zr"
bond: "-" | "=" | "#" | "/" | "\\" | ":"
digit: "0" | nonzero
nonzero: "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Parting note

The way that Grammar VAE uses this grammar is by taking the application of grammar rules to be tokens themselves. In this way, the parse tree can be produced node-by-node, and at every step invalid next nodes can be masked by simply masking the token. An advantage of this approach is that no probability mass goes to producing syntactically invalid strings. However, in practice, I have found this approach does not provide better ELBO bounds than attempting to generate arbitrary character strings. This is because the Grammar VAE strings are significantly longer, so while they only output “useful” strings, they have a much higher number of places where things can “go wrong”. That is probably a good tradeoff to make. However, there is an approach that gets the best of both worlds, but that is for the next post in this series.

Citations

[1] M. Kusner et. al., Grammar Variational Autoencoder

[2] W. Jin et. al., Junction Tree Variational Autoencoder for Molecular Graph Generation

[3] R. L. Apodaca, Balsa: A Compact Line Notation Based on SMILES

[4] C. James et. al., OpenSmiles

[5] BindingDB

[6] RDKit

[7] Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation