Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

10/29/2021
by   Liang He, et al.
0

Understanding protein sequences is vital and urgent for biology, healthcare, and medicine. Labeling approaches are expensive yet time-consuming, while the amount of unlabeled data is increasing quite faster than that of the labeled data due to low-cost, high-throughput sequencing methods. In order to extract knowledge from these unlabeled data, representation learning is of significant value for protein-related tasks and has great potential for helping us learn more about protein functions and structures. The key problem in the protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. Instead of leveraging multiple sequence alignment as is usually done, we propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM). In a conventional masked language model, the masked tokens are modeled by conditioning on the unmasked tokens only, but processed independently to each other. However, our proposed PMLM takes the dependency among masked tokens into consideration, i.e., the probability of a token pair is not equal to the product of the probability of the two tokens. By applying this model, the pre-trained encoder is able to generate a better representation for protein sequences. Our result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9 compared to the MLM baseline under the same setting. The proposed model also significantly outperforms the MSA baseline by more than 7 prediction benchmark when pre-trained on a subset of the sequence database which the MSA is generated from, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.

READ FULL TEXT
research
12/01/2020

Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

For protein sequence datasets, unlabeled data has greatly outpaced label...
research
03/11/2023

Enhancing Protein Language Models with Structure-based Encoder and Pre-training

Protein language models (PLMs) pre-trained on large-scale protein sequen...
research
08/16/2023

PEvoLM: Protein Sequence Evolutionary Information Language Model

With the exponential increase of the protein sequence databases over tim...
research
10/27/2022

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

Pre-trained language model (PTM) has been shown to yield powerful text r...
research
01/30/2023

Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling

Protein representation learning has primarily benefited from the remarka...
research
01/28/2023

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Current protein language models (PLMs) learn protein representations mai...
research
03/18/2021

Rethinking Relational Encoding in Language Model: Pre-Training for General Sequences

Language model pre-training (LMPT) has achieved remarkable results in na...

Please sign up or login with your details

Forgot password? Click here to reset