Third-Generation Sequencing (TGS) technologies have enabled the extraction of longer nucleotide sequences than NGS technologies, allowing for a deeper understanding of genome structure. Despite becoming pivotal for untangling genetic complexity, such long reads are prone to high sequencing error rates (ranging from 10% to 30%), making their correction a practical challenge in many computational genomics pipelines. This paper proposes a workflow (referred to as HyperC) designed for performing long reads self-correction on a distributed memory system. Leveraging on Message Passing Interface (MPI), our workflow introduces an efficient controller-worker communication pattern to coordinate multiple processes scattered across different computing nodes, allowing to speed up the polishing process, when done on large input collections of long reads. We also presents the results of an experimental analysis, conducted on an HPC infrastructure, assessing the ability of our workflow to exploit the resources of a distributed system while allowing for much shorter execution times. These results suggest that HyperC is a promising solution for effectively scaling up the analysis of the nowadays growing volume of sequencing data, contributing significantly to the field of the eScience.
A Distributed Workflow for Long Reads Self-correction / Ceccaroni, Riccardo; Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Brutti, Pierpaolo. - (2025), pp. 105-116. ( Euro-Par 2024 International Workshops Madrid; Spain ) [10.1007/978-3-031-90203-1_10].
A Distributed Workflow for Long Reads Self-correction
Ceccaroni, Riccardo
;Di Rocco, Lorenzo
;Ferraro Petrillo, Umberto;Brutti, Pierpaolo
2025
Abstract
Third-Generation Sequencing (TGS) technologies have enabled the extraction of longer nucleotide sequences than NGS technologies, allowing for a deeper understanding of genome structure. Despite becoming pivotal for untangling genetic complexity, such long reads are prone to high sequencing error rates (ranging from 10% to 30%), making their correction a practical challenge in many computational genomics pipelines. This paper proposes a workflow (referred to as HyperC) designed for performing long reads self-correction on a distributed memory system. Leveraging on Message Passing Interface (MPI), our workflow introduces an efficient controller-worker communication pattern to coordinate multiple processes scattered across different computing nodes, allowing to speed up the polishing process, when done on large input collections of long reads. We also presents the results of an experimental analysis, conducted on an HPC infrastructure, assessing the ability of our workflow to exploit the resources of a distributed system while allowing for much shorter execution times. These results suggest that HyperC is a promising solution for effectively scaling up the analysis of the nowadays growing volume of sequencing data, contributing significantly to the field of the eScience.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


