Over time, the evolution of sequencing platforms has revolutionized the ability to unravel DNA complexity, enabling an increasing understanding of the genetic structure of the organisms. However, these technological advancements have resulted in the generation of vast amounts of data that can be processed, stored, and interpreted. The growing volume of sequencing output has motivated a successful integration of computational genomics with supercomputing and artificial intelligence techniques to efficiently face computational challenges and extract meaningful insights from raw data, ultimately improving the speed and accuracy of genomic analysis. However, the potential of distributed computing in genomics has yet to be fully unlocked. While theoretically advantageous, the distribution of complex bioinformatics tasks is challenging, as it requires a deep understanding of distributed systems and advanced programming skills. This thesis leverages Apache Spark to propose distributed pipelines designed to address critical challenges in computational genomics that involve processing large datasets. Apache Spark is a high-level framework that simplifies and accelerates the development of distributed solutions by managing technical issues internally. This feature makes Apache Spark particularly appealing for developing user-friendly, cloud-compatible libraries that promote the adoption of distributed computing in computational genomics. However, while this abstraction has proven successful in various real-world applications, it may face limitations in addressing the highly complex, sequentially structured problems often encountered in computational genomics, where processing across non-shared memory systems poses unique challenges. For this reason, this work briefly examines Message Passing Interface (MPI) as a low-level distributed computing model, highlighting its role in addressing specific challenges of highly sequential genomics tasks where fine-grained control over task distribution is beneficial. Through extensive experimental evaluations, this thesis aims to assess the strengths and limitations of applying Apache Spark to large-scale problems in computational genomics. Each chapter focuses on a specific genomics-related application that is known for its data-intensive nature and where distributed computing could serve as a strategic resource. For each case, a pipeline is proposed and thoroughly analyzed through experiments aimed at evaluating scalability and identifying potential bottlenecks.

Scalable solutions for large-scale bioinformatics analysis: a critical study of Apache Spark Application in high-performance computational genomics / Di Rocco, Lorenzo. - (2025 Jan 27).

Scalable solutions for large-scale bioinformatics analysis: a critical study of Apache Spark Application in high-performance computational genomics

DI ROCCO, LORENZO
27/01/2025

Abstract

Over time, the evolution of sequencing platforms has revolutionized the ability to unravel DNA complexity, enabling an increasing understanding of the genetic structure of the organisms. However, these technological advancements have resulted in the generation of vast amounts of data that can be processed, stored, and interpreted. The growing volume of sequencing output has motivated a successful integration of computational genomics with supercomputing and artificial intelligence techniques to efficiently face computational challenges and extract meaningful insights from raw data, ultimately improving the speed and accuracy of genomic analysis. However, the potential of distributed computing in genomics has yet to be fully unlocked. While theoretically advantageous, the distribution of complex bioinformatics tasks is challenging, as it requires a deep understanding of distributed systems and advanced programming skills. This thesis leverages Apache Spark to propose distributed pipelines designed to address critical challenges in computational genomics that involve processing large datasets. Apache Spark is a high-level framework that simplifies and accelerates the development of distributed solutions by managing technical issues internally. This feature makes Apache Spark particularly appealing for developing user-friendly, cloud-compatible libraries that promote the adoption of distributed computing in computational genomics. However, while this abstraction has proven successful in various real-world applications, it may face limitations in addressing the highly complex, sequentially structured problems often encountered in computational genomics, where processing across non-shared memory systems poses unique challenges. For this reason, this work briefly examines Message Passing Interface (MPI) as a low-level distributed computing model, highlighting its role in addressing specific challenges of highly sequential genomics tasks where fine-grained control over task distribution is beneficial. Through extensive experimental evaluations, this thesis aims to assess the strengths and limitations of applying Apache Spark to large-scale problems in computational genomics. Each chapter focuses on a specific genomics-related application that is known for its data-intensive nature and where distributed computing could serve as a strategic resource. For each case, a pipeline is proposed and thoroughly analyzed through experiments aimed at evaluating scalability and identifying potential bottlenecks.
27-gen-2025
File allegati a questo prodotto
File Dimensione Formato  
Tesi_dottorato_DiRocco.pdf

accesso aperto

Note: tesi completa
Tipologia: Tesi di dottorato
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 3.2 MB
Formato Adobe PDF
3.2 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1745835
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact