

The average character length of the InChI identifiers was 134 ± 60, and 103 ± 43 for the IUPAC names. For practical applications, this means that a SMILES-based algorithm needs to be able to cope with the numerous equivalent SMILES representation for each molecule, but there is very little discussion of this point in the aforementioned studies.Ī dataset of 100 million SMILES-IUPAC pairs was obtained from PubChem, and the SMILES were converted to InChI with OpenBabel.
NEW ILLUSION GAMES SOFTWARE
All standard InChI representations are generated with the official software from the InChI Trust, and although a normalized SMILES representation exists, it is not common in online databases. However, we believe that our approach is more appropriate for deploying as a service, due to the ubiquitous use of InChI in online chemical databases. To our knowledge, there are two published machine learning models that predict IUPAC names from a SMILES string, confirming our general methodology. We present a seq2seq neural network that predicts the IUPAC name of a chemical from its unique InChI identifier. Two variants of SMILES have been proposed for use in machine learning.

Schwaller used a seq2seq recurrent neural network to predict the outcomes of chemical reactions, and other studies have presented generative models for automatic chemical design. This is done with a sequence-to-sequence (seq2seq) neural network, made up of an encoder, which projects the input sentence into a latent state, and a decoder, which predicts the correct translation from the latent state.Ī number of previous studies have applied sequence-based neural networks to cheminformatics. Compared to earlier efforts that needed human-designed linguistic features, modern machine translation learns these features directly from matched sentence pairs in the source and target language. They have shown great success in natural language processing, and have been deployed by Google on their online translation service. Neural networks excel at making general predictions from a large set of training data. Although canonical SMILES and InChI can be used as identifiers, they are not designed to be human-readable, so the IUPAC name can be more informative identifier. Correctly generating IUPAC names is therefore an open problem, and in particular is an issue faced by synthetic chemists who want to give a standard name to a new compound. Although there are numerous commercial software packages that can generate IUPAC names from a chemical structure, these are all closed source and their methodology is unknown to the general public. Their rules are comprehensive, but are difficult to apply to complicated molecules. The International Union of Pure and Applied Chemistry (IUPAC) define nomenclature for both organic chemistry and inorganic chemistry. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data. The predictions were less accurate for inorganic and organometallic compounds. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation.

We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI).
