Identification of known genetic traits and disease-related variants within an individual requires a fundamental task: genotyping a set of variants from a database. However, the efficiency of this process is challenged by the growing volume of sequencing data and variant databases. At such scale, even the fastest genotyping tool available can deliver a result in a time that is unacceptable. To address this issue, we present SparkGeno, the first known distributed alignment-free pipeline for genotyping the particular case of Single Nucleotide Polymorphisms (SNPs). Building upon a distributed reformulation of traditional alignment-free genotyping pipelines, and using the Apache Spark framework, we introduce several optimizations to further enhance the performance of our code in a distributed environment. Our pipeline comes in two versions that employ different data structures, making them suitable for processing datasets featuring different numbers of SNPs. Moreover, we present the results of an experimental analysis on widely studied datasets to assess how relying on distributed computing allows for a fast, accurate and scalable solution for large-scale genotyping. Finally, we also report the results of an additional experiment for validating the effectiveness of the signature-based approach we used to perform genotyping. Our results show that SparkGeno, when run on a distributed system, is able to genotype variants from whole-genome sequencing data orders of growth faster than existing tools, in a scalable manner in terms of the number of the available computational units. This makes SparkGeno a promising solution for large-scale genotyping applications, such as precision medicine and population-scale studies.
A Distributed Alignment-free Pipeline for Human SNPs Genotyping / Di Rocco, L.; Ferraro Petrillo, U.. - (2023), pp. 1-8. (Intervento presentato al convegno 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics tenutosi a Houston) [10.1145/3584371.3612990].
A Distributed Alignment-free Pipeline for Human SNPs Genotyping
Di Rocco L.;Ferraro Petrillo U.
2023
Abstract
Identification of known genetic traits and disease-related variants within an individual requires a fundamental task: genotyping a set of variants from a database. However, the efficiency of this process is challenged by the growing volume of sequencing data and variant databases. At such scale, even the fastest genotyping tool available can deliver a result in a time that is unacceptable. To address this issue, we present SparkGeno, the first known distributed alignment-free pipeline for genotyping the particular case of Single Nucleotide Polymorphisms (SNPs). Building upon a distributed reformulation of traditional alignment-free genotyping pipelines, and using the Apache Spark framework, we introduce several optimizations to further enhance the performance of our code in a distributed environment. Our pipeline comes in two versions that employ different data structures, making them suitable for processing datasets featuring different numbers of SNPs. Moreover, we present the results of an experimental analysis on widely studied datasets to assess how relying on distributed computing allows for a fast, accurate and scalable solution for large-scale genotyping. Finally, we also report the results of an additional experiment for validating the effectiveness of the signature-based approach we used to perform genotyping. Our results show that SparkGeno, when run on a distributed system, is able to genotype variants from whole-genome sequencing data orders of growth faster than existing tools, in a scalable manner in terms of the number of the available computational units. This makes SparkGeno a promising solution for large-scale genotyping applications, such as precision medicine and population-scale studies.File | Dimensione | Formato | |
---|---|---|---|
Ferraro Petrillo_distibuted-alignment-free_2024.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Creative commons
Dimensione
747.79 kB
Formato
Adobe PDF
|
747.79 kB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.