An Extensible, Scalable Spark Platform for Alignment-free Genomic Analysis – Version 2

by   Umberto Ferraro Petrillo, et al.

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a computationally convenient alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Yet, their use is still to the proof of principle stage: only recently a benchmarking study has coherently evaluated a handful of the functions proposed over the years, identifying a pool of well performing ones. However, more is needed to make this pool usable on a day-to-day basis. In particular, a statistical significance quantification associated to the output of a given function would greatly help when no reference point is available. For most functions, such an analysis is bound to be based on Monte Carlo Hypothesis Test simulations, yielding a dramatic increase in computational time that transforms this into a Big Data problem. Surprisingly, it has been hardly considered, despite the increasing popularity of Big Data Technologies in Computational Biology. Results: We fill this important gap by providing the first user-friendly, extensible, efficient Spark platform for Alignment-free genomic analysis. Thanks to its scalability, Monte Carlo Hypothesis Test simulations on the output of AF functions can seamlessly be afforded for either small or huge collections of sequences. Thus, we are able to comparatively study for the first time AF functions in relation to the statistical significance of their output. Such novel analysis allows us to reduce the pool of well performing functions coming from the benchmarking study to a handful of them.


An Extensible, Scalable Spark Platform for Alignment-free Genomic Analysis – Version 1

Alignment-free similarity/distance functions, a computationally convenie...

The Power of Alignment-Free Histogram-based Functions: a Comprehensive Genome Scale Experimental Analysis – Version 1

Motivation: Alignment-free (AF, for short) distance/similarity functions...

Dynamic Borrowing Method for Historical Information Using a Frequentist Approach for Hybrid Control Design

Information borrowing from historical data is gaining attention in clini...

Monte Carlo Fusion

This paper proposes a new theory and methodology to tackle the problem o...

Robust and Scalable Entity Alignment in Big Data

Entity alignment has always had significant uses within a multitude of d...

Accurate and Efficient Estimation of Small P-values with the Cross-Entropy Method: Applications in Genomic Data Analysis

Small p-values are often required to be accurately estimated in large sc...

Bioinformatics and Classical Literary Study

This paper describes the Quantitative Criticism Lab, a collaborative ini...

Please sign up or login with your details

Forgot password? Click here to reset