Site2Vec: a reference frame invariant algorithm for vector embedding of protein-ligand binding sites

by   Arnab Bhadra, et al.

Protein-ligand interactions are one of the fundamental types of molecular interactions in living systems. Ligands are small molecules that interact with protein molecules at specific regions on their surfaces called binding sites. Tasks such as assessment of protein functional similarity and detection of side effects of drugs need identification of similar binding sites of disparate proteins across diverse pathways. Machine learning methods for similarity assessment require feature descriptors of binding sites. Traditional methods based on hand engineered motifs and atomic configurations are not scalable across several thousands of sites. In this regard, deep neural network algorithms are now deployed which can capture very complex input feature space. However, one fundamental challenge in applying deep learning to structures of binding sites is the input representation and the reference frame. We report here a novel algorithm Site2Vec that derives reference frame invariant vector embedding of a protein-ligand binding site. The method is based on pairwise distances between representative points and chemical compositions in terms of constituent amino acids of a site. The vector embedding serves as a locality sensitive hash function for proximity queries and determining similar sites. The method has been the top performer with more than 95 extensive benchmarking studies carried over 10 datasets and against 23 other site comparison methods. The algorithm serves for high throughput processing and has been evaluated for stability with respect to reference frame shifts, coordinate perturbations and residue mutations. We provide Site2Vec as a stand alone executable and a web service hosted at <>.


Latent Molecular Optimization for Targeted Therapeutic Design

We devise an approach for targeted molecular design, a problem of intere...

Detection of protein-ligand binding sites with 3D segmentation

In recent years machine learning (ML) took bio- and cheminformatics fiel...

On-the-fly Prediction of Protein Hydration Densities and Free Energies using Deep Learning

The calculation of thermodynamic properties of biochemical systems typic...

Assessing the Precision and Recall of msTALI as Applied to an Active-Site Study on Fold Families

Proteins execute various activities required by biological cells. Furthe...

Sparse generative modeling of protein-sequence families

Pairwise Potts models (PM) provide accurate statistical models of famili...

Deep learning based mixed-dimensional GMM for characterizing variability in CryoEM

The function of most protein molecules involves structural flexibility a...

Please sign up or login with your details

Forgot password? Click here to reset