RNA is a vital cellular molecule responsible for gene expression, adapting the organism to various environments and developmental stages. RNA-seq, a massively parallel sequencing technique, enables the identification and quantification of active genes in specific tissues or conditions. Nowadays transcriptomic research projects often reach hundreds of gigabytes, positioning this analysis within the realm of Big Data. However, the RNA-seq protocol is highly sensitive to contamination, which can affect data quality and analysis outcomes. Contaminants are generally classified as either exogenous—such as bacteria, viruses, or fungi from external sources—or endogenous, such as ribosomal RNA (rRNA) sequences that are not part of the target sample. Although several tools exist for cleaning transcriptomic data, most struggle to efficiently handle large datasets, requiring extensive computational resources and significant processing time. This paper introduces HPC-CleanSeq, a bioinformatics pipeline designed to automate contaminant removal in RNA-seq data on High-Performance Computing (HPC) systems. At the core of HPC-CleanSeq is Centrifuge, a well-established tool for identifying and classifying DNA or RNA sequences from complex samples. HPC-CleanSeq is especially suited for large-scale metagenomic and RNA-seq studies, enabling researchers to quickly detect the organisms (e.g., bacteria, viruses, fungi) present in a biological sample. The pipeline offers an intuitive interface, allowing users to configure settings, manage HPC scripts, and visualize results locally. With HPC-CleanSeq, researchers can upload FASTQ files, initiate contaminant removal, and obtain clean data without requiring specialized computational skills, making advanced RNA-seq analysis accessible to a broader scientific community.
HPC-CleanSeq: A Tool for Contamination Removal in Big RNA-Seq Datasets / Liberati, Franco; Giannelli, Federico; Bottoni, Paolo; Castrignanò, Tiziana. - 15546:(2025), pp. 282-293. ( 12th International Conference on Big Data Analytics, BDA 2024 jpn ) [10.1007/978-3-031-86193-2_18].
HPC-CleanSeq: A Tool for Contamination Removal in Big RNA-Seq Datasets
Liberati, Franco;Bottoni, Paolo
;
2025
Abstract
RNA is a vital cellular molecule responsible for gene expression, adapting the organism to various environments and developmental stages. RNA-seq, a massively parallel sequencing technique, enables the identification and quantification of active genes in specific tissues or conditions. Nowadays transcriptomic research projects often reach hundreds of gigabytes, positioning this analysis within the realm of Big Data. However, the RNA-seq protocol is highly sensitive to contamination, which can affect data quality and analysis outcomes. Contaminants are generally classified as either exogenous—such as bacteria, viruses, or fungi from external sources—or endogenous, such as ribosomal RNA (rRNA) sequences that are not part of the target sample. Although several tools exist for cleaning transcriptomic data, most struggle to efficiently handle large datasets, requiring extensive computational resources and significant processing time. This paper introduces HPC-CleanSeq, a bioinformatics pipeline designed to automate contaminant removal in RNA-seq data on High-Performance Computing (HPC) systems. At the core of HPC-CleanSeq is Centrifuge, a well-established tool for identifying and classifying DNA or RNA sequences from complex samples. HPC-CleanSeq is especially suited for large-scale metagenomic and RNA-seq studies, enabling researchers to quickly detect the organisms (e.g., bacteria, viruses, fungi) present in a biological sample. The pipeline offers an intuitive interface, allowing users to configure settings, manage HPC scripts, and visualize results locally. With HPC-CleanSeq, researchers can upload FASTQ files, initiate contaminant removal, and obtain clean data without requiring specialized computational skills, making advanced RNA-seq analysis accessible to a broader scientific community.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


