Portability of Scientific Workflows in NGS Data Analysis: A Case Study

06/04/2020
by   Christopher Schiefer, et al.
0

The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a workflow developed for a particular system on a particular hardware infrastructure to another system or to another infrastructure is non-trivial, which poses a major impediment to the scientific necessities of workflow reproducibility and workflow reusability. In this work, we describe our efforts to port a state-of-the-art workflow for the detection of specific variants in whole-exome sequencing of mice. The workflow originally was developed in the scientific workflow system snakemake for execution on a high-performance cluster controlled by Sun Grid Engine. In the project, we ported it to the scientific workflow system SaasFee that can execute workflows on (multi-core) stand-alone servers or on clusters of arbitrary sizes using the Hadoop. The purpose of this port was that also owners of low-cost hardware infrastructures, for which Hadoop was made for, become able to use the workflow. Although both the source and the target system are called scientific workflow systems, they differ in numerous aspects, ranging from the workflow languages to the scheduling mechanisms and the file access interfaces. These differences resulted in various problems, some expected and more unexpected, that had to be resolved before the workflow could be run with equal semantics. As a side-effect, we also report cost/runtime ratios for a state-of-the-art NGS workflow on very different hardware platforms: A comparably cheap stand-alone server (80 threads), a mid-cost, mid-sized cluster (552 threads), and a high-end HPC system (3784 threads).

READ FULL TEXT
research
10/06/2022

WfBench: Automated Generation of Scientific Workflow Benchmarks

The prevalence of scientific workflows with high computational demands c...
research
10/17/2022

Macaw: The Machine Learning Magnetometer Calibration Workflow

In Earth Systems Science, many complex data pipelines combine different ...
research
05/15/2023

Validity Constraints for Data Analysis Workflows

Porting a scientific data analysis workflow (DAW) to a cluster infrastru...
research
12/15/2021

or2yw: Modeling and Visualizing OpenRefineHistories as YesWorkflow Diagrams

OpenRefine is a popular open-source data cleaning tool. It allows users ...
research
11/23/2022

Towards Advanced Monitoring for Scientific Workflows

Scientific workflows consist of thousands of highly parallelized tasks e...
research
08/04/2022

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep learning has been postulated as a solution for numerous problems in...
research
09/07/2019

Analyzing the HCP Datasets using GPUs: The Anatomy of a Science Engagement

This paper documents the experience improving the performance of a data ...

Please sign up or login with your details

Forgot password? Click here to reset