An Extensible, Scalable Spark Platform for Alignment-free Genomic Analysis – Version 2

05/02/2020
by   Umberto Ferraro Petrillo, et al.
0

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a computationally convenient alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Yet, their use is still to the proof of principle stage: only recently a benchmarking study has coherently evaluated a handful of the functions proposed over the years, identifying a pool of well performing ones. However, more is needed to make this pool usable on a day-to-day basis. In particular, a statistical significance quantification associated to the output of a given function would greatly help when no reference point is available. For most functions, such an analysis is bound to be based on Monte Carlo Hypothesis Test simulations, yielding a dramatic increase in computational time that transforms this into a Big Data problem. Surprisingly, it has been hardly considered, despite the increasing popularity of Big Data Technologies in Computational Biology. Results: We fill this important gap by providing the first user-friendly, extensible, efficient Spark platform for Alignment-free genomic analysis. Thanks to its scalability, Monte Carlo Hypothesis Test simulations on the output of AF functions can seamlessly be afforded for either small or huge collections of sequences. Thus, we are able to comparatively study for the first time AF functions in relation to the statistical significance of their output. Such novel analysis allows us to reduce the pool of well performing functions coming from the benchmarking study to a handful of them.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset