DeepAI
Log In Sign Up

Reprogramming Pretrained Language Models for Protein Sequence Representation Learning

01/05/2023
by   Ria Vinod, et al.
32

Machine Learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale model is computationally expensive. Here, we propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 10^5 times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram an off-the-shelf pre-trained English language transformer and benchmark it on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, stability) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity).

READ FULL TEXT VIEW PDF

page 6

page 8

12/07/2022

When Geometric Deep Learning Meets Pretrained Protein Language Models

Geometric deep learning has recently achieved great success in non-Eucli...
11/30/2022

xTrimoABFold: Improving Antibody Structure Prediction without Multiple Sequence Alignments

In the field of antibody engineering, an essential task is to design a n...
05/16/2021

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

The potential of deep learning has been recognized in the protein struct...
12/07/2020

Reprogramming Language Models for Molecular Representation Learning

Recent advancements in transfer learning have made it a promising approa...
04/04/2022

Multi-Scale Representation Learning on Proteins

Proteins are fundamental biological entities mediating key roles in cell...
10/29/2021

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

Understanding protein sequences is vital and urgent for biology, healthc...
06/19/2019

Evaluating Protein Transfer Learning with TAPE

Protein modeling is an increasingly popular area of machine learning res...

Introduction

Recent advances in artificial intelligence (AI), particularly in deep learning, have led to major innovations and advances in many scientific domains, including biology. These deep learning models aim to learn a highly accurate and compressed representation of the biological system, which then can be employed for a range of tasks. There has been notable success across a range of tasks, from high-quality protein structure prediction from protein sequences

[1; 2], accurate prediction of protein properties, to enabling novel and functional peptide discoveries [3; 4]. Many of these advances rely on developing deep learning models [1; 5; 6] which are trained from scratch on massive amounts (on the order of billions of tokens) of data. However, labeled data in biology is scarce and sparse, which is also the case for many other real-world scenarios in the scientific domain. In the biological domain, label annotation can involve biological assays, high resolution imaging and spectroscopy, which are all costly and time consuming processes.

The technique of pretraining deep learning models was proposed to address this issue. Pretraining methods leverage large amounts of sequence data and can learn to encode features that can explain the variance seen in sequences across biological task-specific training samples. In the context of protein sequences, pretraining has enabled meaningful density modelling across protein functions, structures, and families

[7]. In this work, we reference two types of pretraining methods: (i) unsupervised pretraining, where all data is unlabeled, and (ii) self-supervised pretraining, where a model learns to assign labels to its unlabeled data. Large models then pretrain on massive amounts of unlabeled data, specifically biological sequences, which are available at scale. Once pretrained, these foundation models (FMs) [8] are finetuned on smaller amounts of labeled data, which correspond to a specific downstream task. Interestingly, for the large-scale models pretrained on protein sequences, biological structure and function seem to emerge in the learned protein representation, even though such information was not included in model training [5].

Though highly powerful, the training of those domain-specific foundation models from scratch is highly resource-intensive [9]. For example, one training run of BERT (the language model considered in this work) learns 110 million parameters, costs up to $13,000 USD and takes 64 days (without parallelized computing) and results in 0.7 tons of carbon emissions [10]. A single training run of another popular language model, the T5 transformer, learns 11 billion parameters, costs up to $1.3 million USD, takes 20 days, and results in 47 tons of carbon emissions [11; 12]. Such pretrained language models and size variants are abundantly available with the advent of models libraries (e.g., Hugging Face [13]

) which host pretrained models and datasets. The scale of data, compute, and financial resources required to train these models is not only available to a limited number of researchers, but is also infeasible for applications with limited labeled data. However, in the scientific domain, we still aim to train models with similar representational capacity and predictive performance. To this end, we propose a lightweight, and more accurate alternative method to large-scale pretraining. Specifically, we introduce a method to reprogram an existing foundation model of high capacity that is trained on data from a different domain. This situation calls for innovations in cross-domain transfer learning, which is largely unexplored, particularly in scientific domains.

Figure 1: Left: Descriptions of considered predictive tasks. We select the set of physicochemical property prediction tasks from the well-studied domains in [6], and the biomedical function prediction tasks from works with biomedically relevant small-szied labeled datasets [3; 14]. Center:

We compare R2DL to pretraining and standard supervised training methods. We refer to supervised methods as standard supervised classifiers that are trained from scratch from labeled data alone. Depending on how labeled and unlabeled data are used in pretraining, we consider pretraining to constitute unsupervised/supervised pretraining.

Right: The comparative table shows the broad adaptability of the R2DL framework. In comparison to existing gold standard methods, R2DL is has a broader utility across different domains, sizes of training datasets, and data efficiency. We categorize supervised methods as cross-domain adaptable, through various domain adaptation and transfer learning techniques [15].

One known fact is that biological sequences are similar to natural language, as they also contain long-range dependencies and follow Zipf’s law [16]

. These sequences and their associated dependencies are crucial for determining their structural and functional properties. Such similarity has motivated the use of deep learning architectures and mechanisms that are widely used in natural language processing (NLP) to build protein sequence models from scratch. In this work, we explore an alternative

warm-start paradigm, i.e. how to effectively and efficiently reprogram an existing, fully-trained large English language model to learn a meaningful (i.e., biomedically relevant) representation of protein sequences. The goal is to create a more carbon-friendly, resource-efficient, and broadly accessible framework to motivate different scientific domains toward democratizing the representation power of large AI models. This warm-start paradigm is defined by the framework’s ability to achieve the performance of transformers that are pretrained on billions of tokens, with a lighter-weight training procedure that is similar to that of a standard supervised classifier trained from scratch. In particular, we consider highly specific biological and biomedical protein sequence datasets (illustrated in Figure 1) which have much fewer samples than standard supervised language task datasets. Reprogramming thus provides a more data and resource-efficient approach to developing models to achieve deep representational capacity and performance for downstream protein tasks. Reprogramming has been previously explored in the language domain as a sub-problem of transfer learning [17]. [18] explored reprogramming language models for alternate text classification tasks, [19] reprogrammed acoustic models for time series classification, [20]

reprogrammed ImageNet classification models for alternate image classification tasks. However, none of these methods investigate mappings between domains that require a very high representational capacity (from natural language to biological sequence), which is the setting we require in the protein sequence domain.

Toward this goal, we introduce R2DL (Representation Reprogramming via Dictionary Learning), a novel cross-domain transfer learning framework to reprogram an existing pretrained large-scale deep-learning model of the English language, namely a English BERT model [10], to learn and predict physicochemical and biomedical properties of protein sequences. To the best of our knowledge, our work remains the first work to address reprogramming in any biological, and more broadly, scientific domain. In Figure 1, we illustrate the set of protein physicochemical and functional property prediction tasks we consider, as well as the baseline methods against which we compare R2DL performance to, and a brief description of R2DL’s advantages compared to these existing methods. We test the reprogrammed model for a range of biomedically relevant downstream physicochemical property, structure, and function prediction tasks, which include prediction of secondary structure, homology, mutational stability, solubility, as well as antimicrobial nature, toxicity, and antibody affinity of proteins. Each of these tasks involves learning on datasets which are limited to a few thousands of labeled samples, at least an order of magnitude smaller needed to train a foundation model or a large language model [21]

. R2DL uses dictionary learning, a machine learning framework that finds the optimal sparse linear mapping between the English vocabulary embeddings and the amino acid embeddings. To do so, a protein property prediction task-specific loss is used to learn the optimal parameters of the reprogrammed model. We train R2DL in a supervised setting with the downstream protein prediction task datasets that are labeled and small in size (illustrated in Figure 1). R2DL demonstrates consistent performance improvement from existing baselines across seven different physicochemical (e.g., up to 11% in stability), structural, and functional property prediction (e.g., up to 3% in toxicity) tasks of proteins. We estimate R2DL to be over 105 times more data-efficient than existing pretraining methods. We further demonstrate the performance robustness of R2DL when trained on a reduced size version of the supervised protein datasets. In addition, we show that that R2DL learns to encode physicochemical and biomedical properties in the learned representations, even in a limited data scenarios. This work thus blazes a path toward efficient and large-scale adaptation of existing foundation models toward different real-world learning tasks and accelerates scientific discovery, which naturally involves learning from limited real-world data.

Results

Figure 2 illustrates the proposed Representation Reprogramming via Dictionary Learning (R2DL) framework, which learns to embed a protein sequence dataset of interest by training on the representations of a transformer that is pretrained on an English text corpus. A one-to-one label mapping function is assigned for each downstream protein prediction task for cross-domain machine learning, and a class label or a regression value is predicted using R2DL for each protein sequence during testing. Below we discuss details of the general framework (tasks described in Figure 1).

Figure 2: System illustration of the Representation Reprogramming via Dictionary Learning (R2DL) framework. In Step 1, R2DL loads a pretrained language model (source), obtains the source vocabulary embeddings, and specifies the protein tokens (target). In Step 2, R2DL learns a sparse linear mapping between the source and target embeddings via dictionary learning, to represent a target token embedding as a sparse linear combination of source token embeddings. In Step 3, the system maps the source task labels (e.g., positive/negative sentiments) to target task labels (e.g., toxic/non-toxic proteins) and optimizes the embedding mapping parameters based on the task-specific loss evaluation on a given protein sequence dataset. Finally, in Step 4 the reprogrammed model is deployed for the test-time evaluation.

R2DL Framework Formulation

The R2DL objective is to reprogram a source model (pretrained language model) to be able to correctly classify, or predict the regression values of, protein sequences (for a target prediction task). We use pretrained instances of BERT, a bidirectional transformer (termed the source model), which has been finetuned separately for different language tasks (e.g., sentiment classification, named entity recognition)

[10; 22]. For a protein sequence classification task, we use the source model trained on a language task for which there are sentence output classes (e.g., positive and negative for senitiment classification), and protein sequence classes (e.g., toxic, non-toxic). The output-label mapping is then a simple 1-1 correspondence between the source task labels and the target task labels (e.g., positive toxic and negative

non-toxic). For a regression task, R2DL uses a mapping between the regression values in protein sequence feature space and the classification probability values in the source model embedding space. It does so by learning optimal thresholds of regression values that map to the source model class labels.

The input data of the source English language model is tokenized at the word level. These tokens form the atoms for our dictionary representation of

, a matrix with its rows corresponding to embedding vectors of source tokens. The input data to the target task, protein sequences, are tokenized on a character level with only 20 distinct tokens (corresponding to the set of 20 discrete natural amino acid characters). R2DL obtains

from the learned embeddings of the source model and learns to represent , the matrix of the target token embedding, as a weighted combination of the English token embeddings. We propose token reprogramming by approximating a linear mapping between and . That is, we aim to find a transformation of the latent representation of the protein sequences, such that it can be embedded in the pretrained language model’s latent space and enable R2DL to leverage these re-embedded tokens for learning. Specifically, we learn the linear map by approximating a dictionary using a k-SVD solver [23]. That is, we want to approximate . The k-SVD solver guarantees a task-specific level of sparsity in the coefficients when linearly combining English token embeddings to represent a protein sequence token embedding. In other words, it helps select English tokens and use their linearly combined embeddings as the embedding of a target token. Additionally, with a one-to-one label mapping function of the protein sequence label to the English text label, we are able to use the pretrained language model for inference on the embedded protein dataset, . We thus design an end-to-end reprogramming framework for any arbitrary protein sequence classification or regression task.

R2DL Training and Optimization Procedure

We are given a pretrained classifier, (which has been pretrained on a source-task dataset with source tokens denoted by ) and a target-task dataset with target tokes denoted by . The embedding matrices are and respectively. We can encode an output label mapping function translating between source and target labels. In Figure 2, we show how R2DL aims to find a linear mapping function that learns the optimal coefficients for our atoms in to be represented as a sparse encoding of the dictionary such that . The map is used to reprogram to be able to correctly classify the protein sequences through the transformation where is a protein sequence from a protein task and is the linear weights associated with the protein sequence in . We note that for each of the downstream protein property prediction task, R2DL only trains a corresponding token mapping function while keeping the pretrained classifier intact. Therefore, the number of trainable parameters in R2DL is simply the size of the matrix

, which is usually much smaller compared to the number of parameters in the pretrained deep neural network classifier

. To approximate the dictionary, we use a k-SVD solver to optimize over the cross entropy loss for updates to . We then apply the assigned label mapping for protein classification tasks, or thresholding for regression tasks, and train the mapping function using gradient-based optimization evaluated on the task-specific cross-entropy loss. Details for R2DL training procedure are given in the Method section.

Benchmark Tasks and Evaluation

We consider four physicochemical structure and property prediction tasks from a well-established protein benchmark from [6] (represented in Figure 1). Secondary structure prediction involves predicting secondary structure for each amino acid in a given protein sequence. Solubility prediction considers mapping an input protein sequence to a label of . Homology detection is a sequence classification task, where each input protein is mapped to a label , representing different possible protein folds. Stability prediction is a regression task. We further consider three biomedically relevant function prediction tasks, which are sequence classification tasks (represented in Figure 1). Using R2DL, we predict for a given sequence , its binary class label for antimicrobial-nature prediction [3] or for toxicity prediction [3]. Finally, we predict antigen and non-specific binding of antibody variant sequences from [14]: given a sequence , the task is to predict . Further details on the protein tasks and datasets are in the Method section. The sizes of the individual datasets vary between 4,000 and 50,000 (see supplementary for details on data sizes and train-test splits). Data efficiency is defined as the ratio of the R2DL prediction accuracy to the number of biological sequences used during pretraining and finetuning. We use data efficiency as a metric to compare the performance of R2DL to established benchmarks for the protein tasks in [6; 3; 14]. For classification tasks, we evaluate prediction accuracy with a top-n accuracy, where is the number of classes in the protein sequence classification task. For regression tasks, we evaluate prediction accuracy with Spearman’s correlation.

Model Baselines and Data

The baseline models we consider in this work are of two types. Firstly, we consider models trained in a supervised manner, by training standard sequence Long Range Short Term Memory (LSTM) models from scratch. For each downstream peptide or protein classification task, we have labeled (supervised) datasets. The results of these models are reported in Figure 3(a). Secondly, we consider models that are pretrained in an unsupervised manner on protein sequence data and are fintuned for a particular downstream task. Pretraining methods that do not use labeled data pose an advantage, as those models can then learn from a significantly larger number of data samples. In the cases of the toxicity and antimicrobial prediction tasks, the baseline model we compare to has been pretrained on a subset of the UniProt database where sequences are limited to being 50 residues long [24]

. The pretraining corpus size is then 1.7 million peptide sequences. Using unlabeled data for pretraining is thus much more advantage than pretraining in a supervised scheme. Of these 1.7 million sequences, only 9,000 are labeled (0.005% of sequences). The model is a Wasserstein Autoencoder, which is a generative model that undergoes unsupervised pretraining on the subset of UniProt data. The WAE embeddings of the labeled sequences are then used to train a logistic regressor model on the labeled dataset to obtain a binary classifier for Antimicrobial/non-Antimicrobial (6489 labeled samples) or for toxic/non-toxic (8153 labeled samples) label prediction. For the physicochemical property prediction tasks, the baseline model we consider is pretrained on the Pfam corpus

[25]. This corpus consists of 31 million protein domains and is widely used in bioinformatics pipelines. Sequences are grouped by protein families which are categorized by evolutionarily-related sequences. In contrast, the downstream physicochemical tasks of structure, homology, stability and solubility prediction have labeled datasets that range from 5,000 to 50,000 samples which the model can be finetuned on. Pretraining thus poses the advantage of modeling the density over a range of protein families and structures, but stipulates that there must be sequence datasets that contain structural and functional information about the downstream task datasets, and typically be of a size on the order of millions of sequences. R2DL eliminates this requirement by repurposing existing pretrained English language models, and leveraging transferrable information from models that are not conditioned on protein sequence information.

Data Efficiency and Accuracy of Reprogramming

(a) Downstream supervised protein task dataset sizes and test accuracy of the 3 comparable methods introduced in Figure 1.
(b) Data efficiency of R2DL vs. pretrained methods as illustrated in Figure 1.
(c) Confusion matrix of the baseline model trained in [14] for the antibody affinity prediction task.
(d) Confusion matrix of the R2DL model for the antibody affinity prediction task.
Figure 3: Task-specific evaluation of R2DL performance compared to the performance of the baseline models. In Figure 3(a), results for the pretrained baseline models are from unsupervised pretrained transformers for secondary structure, stability, homology, and solubility prediction tasks [6]. The baseline models for the antimicrobial and toxicity prediction tasks are logistic regressors trained using sequence embeddings from the pretrained peptide wassertein variational autoencoder [3]. Results for the supervised classifiers are from sequence-level LSTMs trained from scratch on the downstream protein prediction data. For classification tasks, we evaluate prediction accuracy with a top-n accuracy, where is the number of classes in the protein sequence classification task. For regression tasks, we evaluate prediction accuracy with Spearman’s correlation coefficient. Results of the pretrained models on the antibody task dataset have not been previously reported in any work and are hence left out for future work. In 3(b), Data efficiency is defined as the ratio of the R2DL prediction accuracy to the number of protein sequences used during training. In Figure 3(c)-(d), we show a comparison between the performance of a linear discriminant analysis (LDA) model in [14] and R2DL on the antibody affinity dataset. The LDA model is a binary classifier which finds the optimal classification boundary by projecting the data onto a one-dimensional feature space and finding a threshold. The antibody affinity dataset consists of 4,000 labeled protein sequences, with labels {1 (on-target binding), 0 (off-target binding)}. R2DL achieves a predictive accuracy of 95.5% compared to the LDA model performance of 92.8%.

We report the performance of R2DL for the set of 7 protein predictive and their corresponding baselines in Figure 3. Baselines for the physicochemical prediction tasks are established by a transformer from [6] that has been pretrained in an unsupervised setting on the Pfam pretraining corpus [26]. Baselines for the antimicrobial and toxicity prediction tasks are established in [3], where Das et al. pretrained a Wasserstein Autoencoder on the peptides from the UniProt corpus [24] using unsupervised training, and then used the latent encodings from autoencoder to train the property classifiers. Baselines for the antibody affinity task are established in [14] where they train a linear discriminant analysis model in a supervised setting. Each physicochemical and biomedical function prediction task then has a relatively small, supervised dataset which we split into training and testing sets to train the R2DL framework and evaluate its performance on the test set. Henceforth, we refer to these baselines as task-specific baselines, whereas the baseline model we compare R2DL to varies with the downstream protein prediction task and the best performing model available (see Supplementary for details on task-specific baselines).

We show that, for every prediction task we achieve a higher test accuracy with R2DL than with the corresponding task-specific baseline model when both models are trained on the full labeled dataset. R2DL shows performance improvement up to 11.2% when compared to the pretrained models, and up to 29.3% performance when compared to a standard, supervised LSTM that is trained from scratch on the same dataset. However, R2DL needs a pretrained source model and only a small-sized, labeled protein sequence dataset as the input. And, therefore the size of R2DL training set is limited to the number of samples in the downstream protein prediction dataset. Pretrained models require a large amount of protein sequence data for pretraining, on the order of samples, in addition to the downstream supervised protein task sequence data that the pretrained model is fine-tuned on. In Figure 3(a), we show the number of training samples and corresponding accuracy metric (see Method section for details) of the R2DL, pretrained, and supervised models. In Figure 3(b), we show the data efficiency, i.e., the ratio of the number of training samples (including the pretraining corpus only of biological sequences for pretrained source models) to the accuracy of the model for R2DL and baseline models. We show that R2DL is a maximum of times more data efficient, as in the case of the toxicity prediction task. This is due to the very large number of pretraining data samples required relative to the downstream protein task dataset.

Figures 3(c) and 3(d) show the R2DL performance on the antigen affinity prediction task for antibody variant sequences and its comparison with the baseline LDA model reported in [14]. R2DL achieves a higher predictive accuracy than the baseline LDA model by 3% and with a higher classification accuracy with imbalanced datasets. The antibody affinity task dataset has the following distribution on target: 1516, off-target: 2484. For 37% to 62% class-imbalance ratio of labels, we show that the R2DL model has a better classification accuracy than the LDA model. The learned representations can therefore be inferred to be more accurate in our model than in the baseline model. This is important, as in many real-world prediction tasks, the dataset is found to be class-imbalanced.

R2DL Performance vs. Pretraining Performance in Low Data Settings

(a) Secondary structure prediction.
(b) Mutational stability prediction.
(c) Remote homology prediction.
(d) Membrane solubility prediction.
(e) Antimicrobial-nature prediction.
(f) Toxicity prediction.
Figure 4: Results of the R2DL model and baseline model for each downstream task in reduced training data settings.

Motivated by the data efficiency of R2DL as a framework, we tested the task-specific predictive performance of R2DL in reduced-data training settings. We compared these results to the performance of task-specific baseline models, when trained and tested in the same restricted data setting. In Figure 4, we show the performance of the R2DL model and then baseline model when trained on 100%, 80%, 60%, and 40% of a specific task dataset. We show results for the Antimicrobial, Toxicity, Secondary Structure, Stability, Homology, and Solubility prediction tasks in Figure 4 and compare the performance of R2DL and pretrained models against the performance of a random guess. We observe, that for downstream tasks of Toxicity, Secondary Structure, Homology and Solubility, R2DL always performs better than a pretrained protein language model across the size range of the limited datasets. Furthermore, we observe that, except in the stability task, the rate of failure to perform better than a random guess is higher for the pretrained models than for R2DL. In both cases, R2DL outperforms pretraining until the cutoff point that is the intersection of the random guess curve with the accuracy curves (the point at which the model is not learning any meaningful representation).

Correlation Between Learned Embeddings and Evolutionary Distances

Beyond comparing the R2DL model against the individual protein task benchmarks, we demonstrate that the R2DL dictionary learning framework shows interpretable correspondences between the learned embeddings in the latent space and the specific protein property. We show this result for the antibody affinity, secondary structure, and toxicity prediction tasks. Figures 5(a-c) show the t-SNE projection of task-specific R2DL embeddings of protein sequences for secondary structure, toxicity, and antibody affinity prediction tasks. Clear separation between different protein classes is evident. We further calculate the similarity between the euclidean distance between the latent representation at the last layer for each amino acid embedding, and compare it to the pairwise evolutionary distance with the BioPython module. In Figure 5(d), we show the euclidean distances between the latent embeddings learned in the R2DL model and the pairwise evolutionary distances between protein sequences, as estimated using BLOSUM62 matrix implemented in the pairwise function of BioPython modulde.

(a) t-SNE clustering plot for secondary structure prediction.
(b) t-SNE clustering plot for toxicity prediction.
(c) t-SNE plot for antibody affinity prediction.

(d) Correlation plot for pairwise evolutionary distances vs. pairwise euclidean distances in R2DL embeddinng space for antibody affinity prediction.
Figure 5: (a-c) Clustering of R2DL learned embeddings for secondary structure prediction, toxicity prediction, and antibody affinity prediction tasks. When tagged by protein property classification, we see very high correspondence between the clusters and protein sequences with the same physicochemical or biomedical property classification. (d) For the antibody affinity prediction task, we observe a high correlation coefficient along the diagonal. This shows that the representation learned by R2DL is highly similar to empirical observations of pairwise residue correlations.

The matrix shows a correlation of close to 1.0 along the diagonal showing a perfect correspondence between the learned representation and the empirical observations of amino acid relatedness. R2DL thus captures the underlying structure of the linear sequence of amino acid residues in protein sequences in the context of the protein task reprogrammed.

Discussion

We propose a new framework, R2DL, to reprogram large language models for various protein tasks. R2DL demonstrates powerful predictive performance across tasks that involve evolutionary understanding, structure prediction, property prediction and protein engineering. We thus provide a strong alternative to pretraining large language models on upto protein sequences. With only a pretrained natural language model (which are abundantly available at the time of writing), a small-sized labeled protein data set of interest, and a small amount of cross-domain finetuning, we can achieve better performance for each protein prediction task with interpretable correspondences between features. Beyond improvements in predictive performance, we show that the ratio of performance improvements to pretraining and training samples involved in the R2DL framework make R2DL up to 105 times more data-efficient than any current methods. This work opens many doors to biological prediction tasks that can acquire very few labeled, high quality data samples. We emphasize the results of the data-efficiency of R2DL, when applied to biomedically relevant protein predictions, which are critical to advancing scientific understanding and discovery, but have been unsuccessful until now.

While R2DL does make gradient updates in the framework, the data and resource requirements of the R2DL method is much lower than any unsupervised or self-supervised pretraining approach for protein sequence modeling. Though R2DL has the same data and resource requirements as any standard supervised training approach, R2DL demonstrates much higher task accuracy across a broad and diverse range of property prediction tasks. We claim that R2DL is able to do this because it can leverage the deep representational capacity induced by reprogramming, which standard supervised models cannot achieve without an unjustifiably large number of parameters. R2DL is thus more efficient than existing baseline models in the following aspects: (i) R2DL only requires a pretrained transformer (trained on English language data) and a small-sized, labeled protein sequence data set of interest. We do not make any updates to the pretrained model itself, unlike traditional transfer learning methods. Rather we make updates to the R2DL model during a supervised training process that optimizes over class-mapped labels. (ii) R2DL does not require large-scale un/self-supervised pretraining on millions of unlabeled protein sequences, as in [6; 3; 5]. (iii) Further, R2DL does not require any large-scale supervised pretraining, which has been found beneficial in protein-specific tasks [6]

as well as in computer vision

[27]. Labeling protein sequences at scale, particularly for biomedical function, is almost infeasible for the size of dataset that is required for supervised pretraining. With these three considerations in mind, we pose R2DL as a data-efficient alternative to pretraining methods for protein prediction tasks of biological and biomedical relevance. To the best of our knowledge, R2DL is the first framework without explicit pretraining that facilitates accurate predictions across a general suite of protein prediction tasks and provides interpretable correspondences between amino acid features that are very closely aligned with domain knowledge (evolutionary distances). The success of R2DL can be attributed to its representational power to encode a sparse representation by leveraging the natural language modeling entailed in large language models for efficient learning on protein structure and function prediction tasks, as both English and protein sequences follow Zipf’s law [16].

We first demonstrate the effectiveness of R2DL on a set of physicochemical structure and property prediction tasks, and then on a set of biomedically relevant function prediction tasks, for protein sequences. We show predictive performance improvements against pretrained methods (up to 11% in stability) and standard supervised methods (up to 3.2% in antibody affinity). Similarly, on the remaining tasks, we show performance improvements over the best reported baseline in structure prediction (4.1%), homology (2.3%), solubility (7.1%), antibody affinity (3.2%), toxicity (2.4%). R2DL thus shows the capability to learn a general representation of protein sequences that can be efficiently adopted to different downstream protein tasks. These powerful representation capabilities as evidenced by its ability to achieve high performance across protein datasets with a highly varied number of task-specific training samples. The performance of R2DL across protein tasks show the potential to repurpose and develop powerful models that can learn from small, curated, and function-specific datasets. This mitigates the need to train large pretrained models for peptide learning tasks. We thus provide an alternative method to pretraining that is cheaper to run and more accurate, and therefore adoptable to broader researcher communities who may not have access to large-scale compute. This potential is critical for many applications, such as discovery of new materials, catalysts, as well as drugs. Although we establish the efficacy and efficiency of R2DL in a domain where pretrained large language models already do exist, we hope that our work will pave the path to extending this approach to other domains where pretrained LLMs do not exist, such as polymers.

Method

Representation of Tokens

In the R2DL framework, we use 2 input datasets, an English language text dataset (source dataset) and a protein sequence dataset (target dataset). The vocabulary size of a protein sequence dataset at a unigram level is 20, as proteins are composed of 20 different natural amino acids. We obtain a latent representation of the English text vocabulary, , by extracting the learned embeddings of the data from a pretrained language model (source model). The protein sequence data is embedded in the same latent space, and is termed the target vocabulary, . For each task, the token embedding matrix is of dimensions where is the number of tokens and is the length of the embedding vectors. We use the same encoding scheme of and across all downstream tasks.

Procedure Description of the R2DL Framework for a Protein Task

  • Procedure Inputs: Pretrained English sentence classifier , target model training data for task , class mapping label function, (if classification) where
    .

  • Procedure Hyperparameters

    : Maximum number of iterations for updates to , number of iterations for k-SVD, step size

  • Procedure Initialization: Random initialization of , obtain the source token embedding matrix

  • Define Objective Function: Objective function for k-SVD:

  • k-SVD Approximation of : If , while use approximate k-SVD to solve ,  

  • Calculate the Loss and Perform Gradient Descent:  ,  and return to the previous K-SVD step

  • Output Protein Sequence Labels for Protein Sequence of Task :

We are given a pretrained English classifier, , and a protein sequence target-task dataset . We denote the task with , such that . We also encode an output label mapping function specifying the one-to-one correspondence between source and target labels. As shown in Figure 2, the source vocabulary embedding, , is extracted from the pretrained model, . The next objective is to learn that approximates the embedding of tokens in (denoted by ) in the representation space of the source model.

We aim to learn that finds the optimal coefficients for each of the target tokens in to be represented as a sparse encoding of the dictionary, , such that . For a given target protein sequence from the -th task, is used to perform the target task through the transformation . While we do not make any modification to the parameters or architecture of , we assume access to the gradient for loss evaluation and parameter updates during training.

A target token embedding can be represented as a sparse linear combination of the source token embeddings (rows) in , . is the representation of the protein token in the dictionary space and satisfies , where is an norm and is made to be sparse by satisfying for all . An exact solution is computationally expensive to find, and is subject to various convergence traps, so for the purpose of our efficient fine-tuning approach we approximate using k-SVD. We first fix the dictionary , as extracted from , and then find the optimal according to the optimization problem, by minimizing the alternative objective subject to as explored in [23]. While algorithms exist to choose an optimal dictionary (an exact solution to k-SVD) that can be continually updated [23], we penalize computational expense over performance for the purpose of maintaining an efficient solution (at the cost of statistically insignificant improvements in accuracy) by using a predetermined number of iterations for k-SVD convergence, which is then used to evaluate the cross entropy loss on and update the mapping function .

Data

Classification

We provide five biologically relevant downstream physicochemical property prediction tasks, adapted from [6] to serve as benchmarks. We categorize these into property prediction, structure prediction, evolutionary understanding, and protein engineering tasks. The sizes of the individual datasets vary between 4,000 and 50,00 (see supplementary for details on data sizes and train-test splits).

Secondary Structure Prediction (Structure Task): Secondary structure (SS) is critical to understanding the function and stability of a protein, and SS prediction is an important intermediate step in designing designing protein complexes. This dataset, obtained from [28] has 8,678 data samples. It is derived from the CB513 dataset, and each amino acid, in a protein sequence is mapped to . The benchmark for this task is a transformer that reports a best performance of 80% accuracy.

Solubility: This task takes an input protein and maps it to a label of . Determining the solubility of proteins is useful when designing proteins or evaluating their function for particular cellular tasks. This dataset, obtained from [29] has 16,253 data samples. The benchmark is a pretrained transformer, that achieves a best performance of 91% on a binary classification task.

Antigen Affinity (Protein Engineering): Therapeutic antibody development requires the selection and engineering of molecules with high affinity and other drug-like biophysical properties. This dataset, obtained from [14] has 4,000 data samples. The task is to map an input protein to a label The task corresponds to predicting antigen and non-specific binding. The benchmark for this task is a Linear Discriminant Analysis model with Spearman’s values for antigen binding (0.87) and for non-specific binding (0.67).

Antimicrobial Prediction (AMP) (Property Task): Determining the antimicrobial nature of a peptide is a critical step in developing antimicrobials to fight against resistant pathogens. The dataset, obtained from [3], consists of 6,489 labeled protein sequences , is mapped to a label . The original model trained on this data provides a de novo approach for discovering new, broad-spectrum and low-toxic antimicrobials. The benchmark for this task is a transformer that reports a best performance of 88% accuracy with a pretrained classifier.

Toxicity (Property Task): Improving the functional profile of molecules, especially in the context of drug discovery, requires optimizing for toxicity and other physicochemical properties. To that end, toxicity is an important property to predict in AMP development. This dataset, obtained from [3] consists of 8,153 antimicrobial peptide sequences which are either toxic (positive class), or non-toxic (negative class). The benchmark for this task is a transformer that reports a best performance of 93.78% accuracy with a pretrained classifier.

Regression

Stability (Protein Engineering Task): This regression task where each protein, is mapped to based on maintaining its fold beyond a threshold of concentration. This dataset, obtained from [30] has 21,446 data samples. Stability is an important protein engineering task, as we can use this fold concentration to test protein inputs such that design candidates are stable in the settings of different tasks. The benchmark for this task is a transformer that reports a best performance of 0.73 Spearman’s .

Homology (Evolutionary Understanding Task): This is a sequence classification task where each input protein, is mapped to a protein fold represented by . This dataset, obtained from [31] has 12,312 data samples. Detecting homologs is particularly important in a biomedical context as they inform structural similarity across a set of sequences, and can indicate emerging resistance of antibiotic genes [cite]. The original model removes entire homologous groups during model training, thereby enforcing that models generalize well when large evolutionary gaps are introduced. The benchmark for this task is a LSTM that reports a best performance of 26% Top-1 Accuracy.

R2DL Settings and Hyperparameter Details

Amp

The full AMP dataset size is 8112, we use a training set size of 6489 and a test set size of 812. We use the norm in our objective function, 10,000 k-SVD iterations and .

Toxicity

The full Toxicity dataset size is 10,192, we use a training set size of 8153 and a test set size of 1020. We use the norm in our objective function, 10,000 k-SVD iterations and .

Secondary Structure

The full Toxicity dataset size is 9270, we use a training set size of 7416 and a test set size of 1854. We use the norm in our objective function, 9,000 k-SVD iterations and .

Stability

The full Stability dataset size is 56,126, we use a training set size of 44,900 and a test set size of 11,226. We use the norm in our objective function, 6,000 k-SVD iterations and .

Homology

The full Homology dataset size is 13,048, we use a training set size of 10,438 and a test set size of 2,610. We use the norm in our objective function, 4,000 k-SVD iterations and .

Solubility

The full Solubility dataset size is 43,876, we use a training set size of 35,100 and a test set size of 8,775. We use the norm in our objective function, 9,000 k-SVD iterations and .

Data and Code Availability

Links to protein sequence data and code are available on Github (github.com/riavinod/r2dl)

References

Supplementary Information

Protein Task Source Model Source Task Regression or Classification Source Labels Target Labels
Antimicrobial Transformer Sentiment Classification Classification Positive, Negative AMP, non-AMP
Toxicity Transformer Sentiment Classification Classification Positive, Negative Toxic, non-Toxic
Secondary Structure Transformer Sentiment Classification Classification Positive, Neutral, Negative Helix, Strand, Other
Stability Transformer Sentiment Classification Regression - -
Homology Transformer Sentiment Classification Regression - -
Solubility Transformer Named Entity Recognition Classification Positive, Negative Soluble, non-Soluble
Binding Transformer Sentiment Classification Classification Positive, Negative On-target, Off-target
Table 1: Summary of the source and target tasks for reprogramming
Figure 6:

Summary of protein prediction tasks and evaluation metrics with model performance.

Model Baselines

Attribute Data-Split Accuracy
Train Valid Test Majority Class Test
{Toxic, non-Toxic} 8153 1019 1020 0.82 0.93
{AMP, non-AMP} 6489 811 812 0.82 0.88
Table 2: Toxicity and Antimicrobial-nature reported in [3].
Task Model Accuracy Metric Test Accuracy
Secondary Structure Prediction One Hot + Alignment Accuracy (3-class) 0.80
Remote Homology Detection LSTM Top 1 Accuracy 0.26
Stability Transformer Spearman’s Rho 0.73
Table 3: Structure prediction, Remote Homology, Stability reported in [6].
Task Model Test Accuracy
Solubility ProtT5-XL-UniRef50 0.91
Table 4: Solubility reported in [45]
Task Model Test Accuracy
Antibody Affinity Linear Discriminant Analysis 0.92
Table 5: Antibody Affinity Binding reported in [14].

R2DL Results

Source Model AMP Sequence Samples k-SVD Iterations Training Accuracy Test Accuracy
BERT (Bidirectional Transformer) 6489 100 87.12 85.64
BERT (Bidirectional Transformer) 6489 250 85.67 82.33
Bi-LSTM 6489 100 79.40 81.90
Table 6: R2DL: AMP Classification
Source Model AMP Sequence Samples k-SVD Iterations Test Accuracy
BERT (Bidirectional Transformer) 8153 100 87.23
BERT (Bidirectional Transformer) 8153 250 86.93
Bi-LSTM 8153 100 81.25
Table 7: R2DL: Toxicity Prediction
Source Model Training Samples k-SVD Iterations Training Accuracy Test Accuracy
BERT 8,678 10000 71.47 63.65
BERT 8,678 15000 74.34 69.91
BERT 8,678 20000 76.32 74.92
Table 8: R2DL: Secondary Structure Prediction
Source Model Training Samples k-SVD Iterations Training Accuracy Test Accuracy
BERT 12,312 10000 11.34 10.76
BERT 12,312 15000 16.45 15.67
BERT 12,312 20000 26.23 24.50
Table 9: R2DL: Remote Homolgy Detection (Top-1 Accuracy)
Source Model Training Samples k-SVD Iterations Training Accuracy Test Accuracy
BERT 53,679 10000 60.23 61.89
BERT 53,679 15000 68.62 67.20
BERT 53,679 20000 70.78 69.73
Table 10: R2DL: Stability (Spearman’s Rho)
Source Model Training Samples k-SVD Iterations Training Accuracy Test Accuracy
BERT 21,446 10000 61.29 52.82
BERT 21,446 15000 61.02 59.46
BERT 21,446 20000 70.90 62.34
Table 11: R2DL: Fluorescence (Spearman’s Rho)
Source Model Training Samples k-SVD Iterations Training Accuracy Test Accuracy
TinyBERT 6623 10000 68.93 69.82
TinyBERT 6623 15000 87.22 89.3
TinyBERT 6623 20000 92.85 93.21
Table 12: R2DL: Solubility

R2DL Results from the Reduced Training Data Setting

0.1 Restricted Training Data Setting

To further investigate the efficacy of the transfer learning approach, we compare the performance of R2DL versus the model trained from scratch with AMP data, with a restricted training data set. The test accuracy across tasks indicate that R2DL performs better when fewer labeled training data samples are available. Below 25% of training data samples, both methods approximately do worse than random prediction, so we do not reduce the training data to evaluate performance after this threshold.

Task Training Samples R2DL Test Accuracy Bi-LSTM Test Accuracy
Toxicity Prediction 5000 42.12 37.34
Toxicity Prediction 6000 62.98 49.62
Toxicity Prediction 7000 86.23 82.78
Toxicity Prediction 8153 89.34 93.7
Table 13: Restricted Data Setting: Toxicity Prediction
Task Training Samples R2DL Test Accuracy Bi-LSTM Test Accuracy
AMP Prediction 3500 59.82 64.52
AMP Prediction 4500 72.76 68.41
AMP Prediction 5500 84.17 74.34
AMP Prediction 6489 90.01 88.0
Table 14: Restricted Data Setting: AMP Prediction
Task Training Samples R2DL Test Accuracy Bi-LSTM Test Accuracy
Structure Prediction 3378 12.09 06.23
Structure Prediction 4478 34.26 37.93
Structure Prediction 6678 69.28 66.34
Structure Prediction 8678 84.14 78.0
Table 15: Restricted Data Setting: Secondary Structure Prediction (SSP)
Task Training Samples R2DL Test Accuracy Bi-LSTM Test Accuracy
Homology 4312 09.35 03.69
Homology 8312 17.26 15.93
Homology 10312 23.23 22.34
Homology 12312 24.14 26.0
Table 16: Restricted Data Setting: Remote Homology Detection
Task Training Samples R2DL Test Accuracy Bi-LSTM Test Accuracy
Fluorescence 10769 12.09 06.23
Fluorescence 25769 34.26 37.93
Fluorescence 45769 69.28 66.34
Fluorescence 53769 66.34 68.0
Table 17: Restricted Data Setting: Fluorescence
Task Training Samples R2DL Test Accuracy Bi-LSTM Test Accuracy
Solubility 2500 011.0 07.23
Solubility 4000 47.26 39.93
Solubility 5200 85.23 87.34
Solubility 6623 94.0 93.1
Table 18: Restricted Data Setting: Solubility Prediction