Reprogramming Pretrained Language Models for Protein Sequence Representation Learning

by   Ria Vinod, et al.

Machine Learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale model is computationally expensive. Here, we propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 10^5 times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram an off-the-shelf pre-trained English language transformer and benchmark it on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, stability) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity).


page 6

page 8


Retrieved Sequence Augmentation for Protein Representation Learning

Protein language models have excelled in a variety of tasks, ranging fro...

xTrimoABFold: Improving Antibody Structure Prediction without Multiple Sequence Alignments

In the field of antibody engineering, an essential task is to design a n...

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

The potential of deep learning has been recognized in the protein struct...

Reprogramming Language Models for Molecular Representation Learning

Recent advancements in transfer learning have made it a promising approa...

Multi-Scale Representation Learning on Proteins

Proteins are fundamental biological entities mediating key roles in cell...

When Geometric Deep Learning Meets Pretrained Protein Language Models

Geometric deep learning has recently achieved great success in non-Eucli...

Generative Pretrained Autoregressive Transformer Graph Neural Network applied to the Analysis and Discovery of Novel Proteins

We report a flexible language-model based deep learning strategy, applie...

Please sign up or login with your details

Forgot password? Click here to reset