The development of data-dependent heuristics and representations for biological sequences that reflect their evolutionary distance is critical for large-scale biological research. However, popular machine learning approaches, based on continuous Euclidean spaces, have struggled with the discrete combinatorial formulation of the edit distance that models evolution and the hierarchical relationship that characterises real-world datasets. We present Neural Distance Embeddings (NeuroSEED), a general framework to embed sequences in geometric vector spaces, and illustrate the effectiveness of the hyperbolic space that captures the hierarchical structure and provides an average 22% reduction in embedding RMSE against the best competing geometry. The capacity of the framework and the significance of these improvements are then demonstrated devising supervised and unsupervised NeuroSEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets. As an example for hierarchical clustering, the proposed pretrained and from-scratch methods match the quality of competing baselines with 30x and 15x runtime reduction, respectively.

Neural Distance Embeddings for Biological Sequences / Corso, G.; Ying, R.; Pandy, M.; Velickovic, P.; Leskovec, J.; Lio, P.. - 22:(2021), pp. 18539-18551. ( Advances in Neural Information Processing Systems (was NIPS) virtual ).

Neural Distance Embeddings for Biological Sequences

Lio P.
2021

Abstract

The development of data-dependent heuristics and representations for biological sequences that reflect their evolutionary distance is critical for large-scale biological research. However, popular machine learning approaches, based on continuous Euclidean spaces, have struggled with the discrete combinatorial formulation of the edit distance that models evolution and the hierarchical relationship that characterises real-world datasets. We present Neural Distance Embeddings (NeuroSEED), a general framework to embed sequences in geometric vector spaces, and illustrate the effectiveness of the hyperbolic space that captures the hierarchical structure and provides an average 22% reduction in embedding RMSE against the best competing geometry. The capacity of the framework and the significance of these improvements are then demonstrated devising supervised and unsupervised NeuroSEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets. As an example for hierarchical clustering, the proposed pretrained and from-scratch methods match the quality of competing baselines with 30x and 15x runtime reduction, respectively.
2021
Advances in Neural Information Processing Systems (was NIPS)
Biology; Reduction; Vector spaces
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Neural Distance Embeddings for Biological Sequences / Corso, G.; Ying, R.; Pandy, M.; Velickovic, P.; Leskovec, J.; Lio, P.. - 22:(2021), pp. 18539-18551. ( Advances in Neural Information Processing Systems (was NIPS) virtual ).
File allegati a questo prodotto
File Dimensione Formato  
Corso_Neural_2021.pdf

accesso aperto

Note: https://proceedings.neurips.cc/paper/2021/file/9a1de01f893e0d2551ecbb7ce4dc963e-Paper.pdf
Tipologia: Documento in Pre-print (manoscritto inviato all'editore, precedente alla peer review)
Licenza: Creative commons
Dimensione 983.12 kB
Formato Adobe PDF
983.12 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1720269
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 30
  • ???jsp.display-item.citation.isi??? 0
social impact