Identification of known genetic traits and disease-related variants within an individual requires a fundamental task: genotyping a set of variants from a database. However, the efficiency of this process is challenged by the growing volume of sequencing data and variant databases. At such scale, even the fastest genotyping tool available can deliver a result in a time that is unacceptable. To address this issue, we present SparkGeno, the first known distributed alignment-free pipeline for genotyping the particular case of Single Nucleotide Polymorphisms (SNPs). Building upon a distributed reformulation of traditional alignment-free genotyping pipelines, and using the Apache Spark framework, we introduce several optimizations to further enhance the performance of our code in a distributed environment. Our pipeline comes in two versions that employ different data structures, making them suitable for processing datasets featuring different numbers of SNPs. Moreover, we present the results of an experimental analysis on widely studied datasets to assess how relying on distributed computing allows for a fast, accurate and scalable solution for large-scale genotyping. Finally, we also report the results of an additional experiment for validating the effectiveness of the signature-based approach we used to perform genotyping. Our results show that SparkGeno, when run on a distributed system, is able to genotype variants from whole-genome sequencing data orders of growth faster than existing tools, in a scalable manner in terms of the number of the available computational units. This makes SparkGeno a promising solution for large-scale genotyping applications, such as precision medicine and population-scale studies.

A Distributed Alignment-free Pipeline for Human SNPs Genotyping / Di Rocco, L.; Ferraro Petrillo, U.. - (2023), pp. 1-8. (Intervento presentato al convegno 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics tenutosi a Houston) [10.1145/3584371.3612990].

A Distributed Alignment-free Pipeline for Human SNPs Genotyping

Di Rocco L.;Ferraro Petrillo U.
2023

Abstract

Identification of known genetic traits and disease-related variants within an individual requires a fundamental task: genotyping a set of variants from a database. However, the efficiency of this process is challenged by the growing volume of sequencing data and variant databases. At such scale, even the fastest genotyping tool available can deliver a result in a time that is unacceptable. To address this issue, we present SparkGeno, the first known distributed alignment-free pipeline for genotyping the particular case of Single Nucleotide Polymorphisms (SNPs). Building upon a distributed reformulation of traditional alignment-free genotyping pipelines, and using the Apache Spark framework, we introduce several optimizations to further enhance the performance of our code in a distributed environment. Our pipeline comes in two versions that employ different data structures, making them suitable for processing datasets featuring different numbers of SNPs. Moreover, we present the results of an experimental analysis on widely studied datasets to assess how relying on distributed computing allows for a fast, accurate and scalable solution for large-scale genotyping. Finally, we also report the results of an additional experiment for validating the effectiveness of the signature-based approach we used to perform genotyping. Our results show that SparkGeno, when run on a distributed system, is able to genotype variants from whole-genome sequencing data orders of growth faster than existing tools, in a scalable manner in terms of the number of the available computational units. This makes SparkGeno a promising solution for large-scale genotyping applications, such as precision medicine and population-scale studies.
2023
14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
snp detection; distributed computing; genotyping
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
A Distributed Alignment-free Pipeline for Human SNPs Genotyping / Di Rocco, L.; Ferraro Petrillo, U.. - (2023), pp. 1-8. (Intervento presentato al convegno 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics tenutosi a Houston) [10.1145/3584371.3612990].
File allegati a questo prodotto
File Dimensione Formato  
Ferraro Petrillo_distibuted-alignment-free_2024.pdf

accesso aperto

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 747.79 kB
Formato Adobe PDF
747.79 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1692306
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact