RNA is a vital cellular molecule responsible for gene expression, adapting the organism to various environments and developmental stages. RNA-seq, a massively parallel sequencing technique, enables the identification and quantification of active genes in specific tissues or conditions. Nowadays transcriptomic research projects often reach hundreds of gigabytes, positioning this analysis within the realm of Big Data. However, the RNA-seq protocol is highly sensitive to contamination, which can affect data quality and analysis outcomes. Contaminants are generally classified as either exogenous—such as bacteria, viruses, or fungi from external sources—or endogenous, such as ribosomal RNA (rRNA) sequences that are not part of the target sample. Although several tools exist for cleaning transcriptomic data, most struggle to efficiently handle large datasets, requiring extensive computational resources and significant processing time. This paper introduces HPC-CleanSeq, a bioinformatics pipeline designed to automate contaminant removal in RNA-seq data on High-Performance Computing (HPC) systems. At the core of HPC-CleanSeq is Centrifuge, a well-established tool for identifying and classifying DNA or RNA sequences from complex samples. HPC-CleanSeq is especially suited for large-scale metagenomic and RNA-seq studies, enabling researchers to quickly detect the organisms (e.g., bacteria, viruses, fungi) present in a biological sample. The pipeline offers an intuitive interface, allowing users to configure settings, manage HPC scripts, and visualize results locally. With HPC-CleanSeq, researchers can upload FASTQ files, initiate contaminant removal, and obtain clean data without requiring specialized computational skills, making advanced RNA-seq analysis accessible to a broader scientific community.

HPC-CleanSeq: A Tool for Contamination Removal in Big RNA-Seq Datasets / Liberati, Franco; Giannelli, Federico; Bottoni, Paolo; Castrignanò, Tiziana. - 15546:(2025), pp. 282-293. ( 12th International Conference on Big Data Analytics, BDA 2024 jpn ) [10.1007/978-3-031-86193-2_18].

HPC-CleanSeq: A Tool for Contamination Removal in Big RNA-Seq Datasets

Liberati, Franco;Bottoni, Paolo
;
2025

Abstract

RNA is a vital cellular molecule responsible for gene expression, adapting the organism to various environments and developmental stages. RNA-seq, a massively parallel sequencing technique, enables the identification and quantification of active genes in specific tissues or conditions. Nowadays transcriptomic research projects often reach hundreds of gigabytes, positioning this analysis within the realm of Big Data. However, the RNA-seq protocol is highly sensitive to contamination, which can affect data quality and analysis outcomes. Contaminants are generally classified as either exogenous—such as bacteria, viruses, or fungi from external sources—or endogenous, such as ribosomal RNA (rRNA) sequences that are not part of the target sample. Although several tools exist for cleaning transcriptomic data, most struggle to efficiently handle large datasets, requiring extensive computational resources and significant processing time. This paper introduces HPC-CleanSeq, a bioinformatics pipeline designed to automate contaminant removal in RNA-seq data on High-Performance Computing (HPC) systems. At the core of HPC-CleanSeq is Centrifuge, a well-established tool for identifying and classifying DNA or RNA sequences from complex samples. HPC-CleanSeq is especially suited for large-scale metagenomic and RNA-seq studies, enabling researchers to quickly detect the organisms (e.g., bacteria, viruses, fungi) present in a biological sample. The pipeline offers an intuitive interface, allowing users to configure settings, manage HPC scripts, and visualize results locally. With HPC-CleanSeq, researchers can upload FASTQ files, initiate contaminant removal, and obtain clean data without requiring specialized computational skills, making advanced RNA-seq analysis accessible to a broader scientific community.
2025
12th International Conference on Big Data Analytics, BDA 2024
bioinformatic; decontamination; HPC; RNA-seq
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
HPC-CleanSeq: A Tool for Contamination Removal in Big RNA-Seq Datasets / Liberati, Franco; Giannelli, Federico; Bottoni, Paolo; Castrignanò, Tiziana. - 15546:(2025), pp. 282-293. ( 12th International Conference on Big Data Analytics, BDA 2024 jpn ) [10.1007/978-3-031-86193-2_18].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1752178
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact