OpenProteinSet: Training data for structural biology at scale

08/10/2023
by   Gustaf Ahdritz, et al.
0

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

READ FULL TEXT

page 4

page 6

research
06/24/2022

PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction

Proteins are essential component of human life and their structures are ...
research
02/01/2019

ProteinNet: a standardized data set for machine learning of protein structure

Rapid progress in deep learning has spurred its application to bioinform...
research
11/19/2019

PDBMine: A Reformulation of the Protein Data Bank to Facilitate Structural Data Mining

Large scale initiatives such as the Human Genome Project, Structural Gen...
research
07/13/2020

ProteiNN: Intrinsic-Extrinsic Convolution and Pooling for Scalable Deep Protein Analysis

Proteins perform a large variety of functions in living organisms, thus ...
research
10/03/2020

Decoy Selection for Protein Structure Prediction Via Extreme Gradient Boosting and Ranking

Identifying one or more biologically-active/native decoys from millions ...
research
01/15/2019

Comparing two deep learning sequence-based models for protein-protein interaction prediction

Biological data are extremely diverse, complex but also quite sparse. Th...
research
12/23/2019

BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale

Capturing the semantics of related biological concepts, such as genes an...

Please sign up or login with your details

Forgot password? Click here to reset