Taming Large-Scale Genomic Analyses via Sparsified Genomics
Searching for similar genomic sequences is an essential and fundamental step in biomedical research and an overwhelming majority of genomic analyses. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable much faster and more memory-efficient processing of the sparsified, shorter genomic sequences, while providing similar or even higher accuracy compared to processing non-sparsified sequences. Sparsified genomics provides significant benefits to many genomic analyses and has broad applicability. We show that sparsifying genomic sequences greatly accelerates the state-of-the-art read mapper (minimap2) by 1.54-8.8x using real Illumina, HiFi, and ONT reads, while providing a higher number of mapped reads and more detected small and structural variations. Sparsifying genomic sequences makes containment search through very large genomes and very large databases 72.7-75.88x faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-art tool (Metalign). We design and open-source a framework called Genome-on-Diet as an example tool for sparsified genomics, which can be freely downloaded from https://github.com/CMU-SAFARI/Genome-on-Diet.
READ FULL TEXT