A Distributed Workflow for Long Reads Self-correction

Ceccaroni, Riccardo; Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Brutti, Pierpaolo

doi:10.1007/978-3-031-90203-1_10

Third-Generation Sequencing (TGS) technologies have enabled the extraction of longer nucleotide sequences than NGS technologies, allowing for a deeper understanding of genome structure. Despite becoming pivotal for untangling genetic complexity, such long reads are prone to high sequencing error rates (ranging from 10% to 30%), making their correction a practical challenge in many computational genomics pipelines. This paper proposes a workflow (referred to as HyperC) designed for performing long reads self-correction on a distributed memory system. Leveraging on Message Passing Interface (MPI), our workflow introduces an efficient controller-worker communication pattern to coordinate multiple processes scattered across different computing nodes, allowing to speed up the polishing process, when done on large input collections of long reads. We also presents the results of an experimental analysis, conducted on an HPC infrastructure, assessing the ability of our workflow to exploit the resources of a distributed system while allowing for much shorter execution times. These results suggest that HyperC is a promising solution for effectively scaling up the analysis of the nowadays growing volume of sequencing data, contributing significantly to the field of the eScience.

A Distributed Workflow for Long Reads Self-correction / Ceccaroni, Riccardo; Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Brutti, Pierpaolo. - (2025), pp. 105-116. ( Euro-Par 2024 International Workshops Madrid; Spain ) [10.1007/978-3-031-90203-1_10].

A Distributed Workflow for Long Reads Self-correction

Ceccaroni, Riccardo;Di Rocco, Lorenzo;Ferraro Petrillo, Umberto;Brutti, Pierpaolo

2025

Abstract

Third-Generation Sequencing (TGS) technologies have enabled the extraction of longer nucleotide sequences than NGS technologies, allowing for a deeper understanding of genome structure. Despite becoming pivotal for untangling genetic complexity, such long reads are prone to high sequencing error rates (ranging from 10% to 30%), making their correction a practical challenge in many computational genomics pipelines. This paper proposes a workflow (referred to as HyperC) designed for performing long reads self-correction on a distributed memory system. Leveraging on Message Passing Interface (MPI), our workflow introduces an efficient controller-worker communication pattern to coordinate multiple processes scattered across different computing nodes, allowing to speed up the polishing process, when done on large input collections of long reads. We also presents the results of an experimental analysis, conducted on an HPC infrastructure, assessing the ability of our workflow to exploit the resources of a distributed system while allowing for much shorter execution times. These results suggest that HyperC is a promising solution for effectively scaling up the analysis of the nowadays growing volume of sequencing data, contributing significantly to the field of the eScience.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				Euro-Par 2024 International Workshops
			
	Parole chiave
	
				Third-Generation Sequencing; Long reads; Error correction; HyperC workflow; Message Passing Interface; High-Performance Computing
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				A Distributed Workflow for Long Reads Self-correction / Ceccaroni, Riccardo; Di Rocco, Lorenzo; Ferraro Petrillo, Umberto; Brutti, Pierpaolo. - (2025), pp. 105-116. ( Euro-Par 2024 International Workshops Madrid; Spain ) [10.1007/978-3-031-90203-1_10].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1748806

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

Catalogo dei prodotti della ricerca