Third-Generation Sequencing (TGS) technologies have enabled the extraction of longer nucleotide sequences than NGS technologies, allowing for a deeper understanding of genome structure. Despite becoming pivotal for untangling genetic complexity, such long reads are prone to high sequencing error rates (ranging from 10% to 30%), making their correction a practical challenge in many computational genomics pipelines. This paper proposes a workflow (referred to as HyperC) designed for performing long reads self-correction on a distributed memory system. Leveraging on Message Passing Interface (MPI), our workflow introduces an efficient controller-worker communication pattern to coordinate multiple processes scattered across different computing nodes, allowing to speed up the polishing process, when done on large input collections of long reads. We also presents the results of an experimental analysis, conducted on an HPC infrastructure, assessing the ability of our workflow to exploit the resources of a distributed system while allowing for much shorter execution times. These results suggest that HyperC is a promising solution for effectively scaling up the analysis of the nowadays growing volume of sequencing data, contributing significantly to the field of the eScience.

A Distributed Workflow for Long Reads Self-correction / Ceccaroni, Riccardo; Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Brutti, Pierpaolo. - (2025), pp. 105-116. ( Euro-Par 2024 International Workshops Madrid; Spain ) [10.1007/978-3-031-90203-1_10].

A Distributed Workflow for Long Reads Self-correction

Ceccaroni, Riccardo
;
Di Rocco, Lorenzo
;
Ferraro Petrillo, Umberto;Brutti, Pierpaolo
2025

Abstract

Third-Generation Sequencing (TGS) technologies have enabled the extraction of longer nucleotide sequences than NGS technologies, allowing for a deeper understanding of genome structure. Despite becoming pivotal for untangling genetic complexity, such long reads are prone to high sequencing error rates (ranging from 10% to 30%), making their correction a practical challenge in many computational genomics pipelines. This paper proposes a workflow (referred to as HyperC) designed for performing long reads self-correction on a distributed memory system. Leveraging on Message Passing Interface (MPI), our workflow introduces an efficient controller-worker communication pattern to coordinate multiple processes scattered across different computing nodes, allowing to speed up the polishing process, when done on large input collections of long reads. We also presents the results of an experimental analysis, conducted on an HPC infrastructure, assessing the ability of our workflow to exploit the resources of a distributed system while allowing for much shorter execution times. These results suggest that HyperC is a promising solution for effectively scaling up the analysis of the nowadays growing volume of sequencing data, contributing significantly to the field of the eScience.
2025
Euro-Par 2024 International Workshops
Third-Generation Sequencing; Long reads; Error correction; HyperC workflow; Message Passing Interface; High-Performance Computing
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
A Distributed Workflow for Long Reads Self-correction / Ceccaroni, Riccardo; Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Brutti, Pierpaolo. - (2025), pp. 105-116. ( Euro-Par 2024 International Workshops Madrid; Spain ) [10.1007/978-3-031-90203-1_10].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1748806
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact