Alignment-free algorithms are used in bioinformatics to efficiently evaluate the similarity between pairs of genomic sequences. They work by extracting and aggregating features from the sequences under study and, then, by comparing them using alignment-free functions. When working on large collections of huge sequences, it is possible to improve the performance of these algorithms by executing the extraction and the aggregation steps, in parallel, across the computing nodes of a distributed system. In this work, we address the problem of finding the optimal schedule to use for assigning to computing nodes the features to be aggregated, in order to minimize the maximum aggregation time. For this purpose, we consider one exact mathematical programming approach and two approximated ones, based on the Longest Processing Time heuristic. These have been implemented using the Gurobi solver, and compared to the algorithm used by the Spark distributed computing framework for assigning tasks to computing nodes. The experiments have been performed on some large collections of genomic sequences well-known in literature. The results show that the proposed approaches perform favourably with respect to the scheduling strategy used by Spark, in terms of quality of the solution. In particular the exact approach, run up to the time limit, allows to reduce the makespan up to 77.11%, when considering the largest instance tested. However, the required computational time of the exact approach is not compatible with its online application. The approximated approaches appear more promising in terms of computational time, while providing good quality solutions.

Scheduling K-mers Counting in a Distributed Environment / Amorosi, L.; Di Rocco, L.; Ferraro Petrillo, U.. - (2022), pp. 73-83. (Intervento presentato al convegno International Conference on Optimization and Decision Sciences, ODS 2021 tenutosi a Rome, Italy) [10.1007/978-3-030-95380-5_7].

Scheduling K-mers Counting in a Distributed Environment

Amorosi L.;Di Rocco L.;Ferraro Petrillo U.
2022

Abstract

Alignment-free algorithms are used in bioinformatics to efficiently evaluate the similarity between pairs of genomic sequences. They work by extracting and aggregating features from the sequences under study and, then, by comparing them using alignment-free functions. When working on large collections of huge sequences, it is possible to improve the performance of these algorithms by executing the extraction and the aggregation steps, in parallel, across the computing nodes of a distributed system. In this work, we address the problem of finding the optimal schedule to use for assigning to computing nodes the features to be aggregated, in order to minimize the maximum aggregation time. For this purpose, we consider one exact mathematical programming approach and two approximated ones, based on the Longest Processing Time heuristic. These have been implemented using the Gurobi solver, and compared to the algorithm used by the Spark distributed computing framework for assigning tasks to computing nodes. The experiments have been performed on some large collections of genomic sequences well-known in literature. The results show that the proposed approaches perform favourably with respect to the scheduling strategy used by Spark, in terms of quality of the solution. In particular the exact approach, run up to the time limit, allows to reduce the makespan up to 77.11%, when considering the largest instance tested. However, the required computational time of the exact approach is not compatible with its online application. The approximated approaches appear more promising in terms of computational time, while providing good quality solutions.
2022
International Conference on Optimization and Decision Sciences, ODS 2021
Identical parallel machines; Job scheduling; Genomic analysis
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Scheduling K-mers Counting in a Distributed Environment / Amorosi, L.; Di Rocco, L.; Ferraro Petrillo, U.. - (2022), pp. 73-83. (Intervento presentato al convegno International Conference on Optimization and Decision Sciences, ODS 2021 tenutosi a Rome, Italy) [10.1007/978-3-030-95380-5_7].
File allegati a questo prodotto
File Dimensione Formato  
Amorosi_Scheduling-K-mers_2022.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 4.82 MB
Formato Adobe PDF
4.82 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1644217
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact