Improving Chemical Autoencoder Latent Space and Molecular De novo Generation Diversity with Heteroencoders
Chemical autoencoders are attractive models as they combine chemical space navigation with possibilities for de novo molecule generation in areas of interest. This enables them to produce focused chemical libraries around a single lead compound so they can be employed early in a drug discovery project. Here it is shown that the choice of chemical representation, such as SMILES strings, has a large influence on the properties of the latent space. It is further explored to what extent translating between different chemical representations influences the latent space similarity to the SMILES strings or circular fingerprints. By employing SMILES enumeration for both the encoder and decoder, it is found that the decoder has the largest influence on the properties of the latent space. Training a sequence to sequence heteroencoder based on recurrent neural networks(RNNs) with long short-term memory cells (LSTM) to predicts different enumerated SMILES strings from the same canonical SMILES string gives the largest similarity between latent space distance and molecular similarity measured as circular fingerprints similarity.
READ FULL TEXT