Catalogo dei prodotti della ricerca

Phylogeny inference has moved in recent years from the analysis of a single or few proteins to that of whole proteomes. However, the reconstruction of evolutionary trees for big number of species poses a significant computational challenge when using complete proteomes, even when relatively fast pairwise sequence comparison algorithms are used. We present a distributed approach that relies on the computation of distance measures based on maximal shared substrings within a bounded Hamming distance. The distributed system we built to implement this approach is flexible in that it supports a variety of design choices. It is based on the Spark framework and covers all the steps required by our approach, starting from the initial indexing of a set of FASTA sequences up to producing a report detailing the distances among these sequences, ranked according to a user-defined measure. Here we apply it to compare all proteins of selected organisms, divide them into groups and perform the comparisons within each group separately. The groups include: the functionally characterized proteins, the ribosomal proteins, and the unannotated proteins. We compute the average distances within the groups and evaluate their relationship and ability to capture the evolutionary closeness of organisms. We run experiments on selected species using a Hadoop computing cluster running Spark. The results show that the system implementing our approach is scalable and accurate

A new distributed alignment-free approach to compare whole proteomes / FERRARO PETRILLO, Umberto; Guerra, Concettina; Pizzi, Cinzia. - In: THEORETICAL COMPUTER SCIENCE. - ISSN 0304-3975. - STAMPA. - 698:(2017), pp. 100-112. [10.1016/j.tcs.2017.06.017]

A new distributed alignment-free approach to compare whole proteomes

FERRARO PETRILLO, UMBERTO;GUERRA, Concettina;Pizzi, Cinzia

2017

Abstract

Phylogeny inference has moved in recent years from the analysis of a single or few proteins to that of whole proteomes. However, the reconstruction of evolutionary trees for big number of species poses a significant computational challenge when using complete proteomes, even when relatively fast pairwise sequence comparison algorithms are used. We present a distributed approach that relies on the computation of distance measures based on maximal shared substrings within a bounded Hamming distance. The distributed system we built to implement this approach is flexible in that it supports a variety of design choices. It is based on the Spark framework and covers all the steps required by our approach, starting from the initial indexing of a set of FASTA sequences up to producing a report detailing the distances among these sequences, ranked according to a user-defined measure. Here we apply it to compare all proteins of selected organisms, divide them into groups and perform the comparisons within each group separately. The groups include: the functionally characterized proteins, the ribosomal proteins, and the unannotated proteins. We compute the average distances within the groups and evaluate their relationship and ability to capture the evolutionary closeness of organisms. We run experiments on selected species using a Hadoop computing cluster running Spark. The results show that the system implementing our approach is scalable and accurate

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2017
			
	Parole chiave
	
				alignment free distances; average common substring; bioinformatics; distributed systems; mismatches; theoretical computer science; computer science (all)
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				A new distributed alignment-free approach to compare whole proteomes / FERRARO PETRILLO, Umberto; Guerra, Concettina; Pizzi, Cinzia. - In: THEORETICAL COMPUTER SCIENCE. - ISSN 0304-3975. - STAMPA. - 698:(2017), pp. 100-112. [10.1016/j.tcs.2017.06.017]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Ferraro Petrillo_new-distributed_2017.pdf accesso aperto Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.8 MB Formato Adobe PDF	1.8 MB	Adobe PDF
Ferraro Petrillo_new-distributed_2017.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.42 MB Formato Adobe PDF Contatta l'autore	1.42 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/991994

Citazioni

ND

7

7

social impact