PTransIPs: Identification of phosphorylation sites based on protein pretrained language model and Transformer
Phosphorylation is central to numerous fundamental cellular processes, influencing the onset and progression of a variety of diseases. Identification of phosphorylation sites is thus an important step for understanding the molecular mechanisms of cells and virus infection, which potentially leads to new therapeutic targets. In this study, we present PTransIPs, a novel deep learning model for the identification of phosphorylation sites. PTransIPs treats amino acids in protein sequences as words in natural language, extracting unique encodings based on the types along with position of amino acids in the sequence. It also incorporates embeddings from large pre-trained protein models as additional data inputs. PTransIPS is further trained on a combination model of convolutional neural network with residual connections and Transformer model equipped with multi-head attention mechanisms. At last, the model outputs classification results through a fully connected layer. The results of independent testing reveal that PTransIPs outperforms existing state-of-the-art methodologies, achieving AUROCs of 0.9232 and 0.9660 for identifying phosphorylated S/T and Y sites respectively. In addition, ablation studies prove that pretrained model embeddings contribute to the performance of PTransIPs. Furthermore, PTransIPs has interpretable amino acid preference, visible training process and shows generalizability on other bioactivity classification tasks. To facilitate usage, our code and data are publicly accessible at <https://github.com/StatXzy7/PTransIPs>.
READ FULL TEXT