We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.

Evaluating Representation Learning on the Protein Structure Universe / Jamasb, A. R.; Morehead, A.; Joshi, C. K.; Zhang, Z.; Didi, K.; Mathis, S.; Harris, C.; Tang, J.; Cheng, J.; Lio, P.; Blundell, T. L.. - (2024). (Intervento presentato al convegno International Conference on Learning Representations tenutosi a Hybrid, Vienna).

Evaluating Representation Learning on the Protein Structure Universe

Lio P.
;
2024

Abstract

We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.
2024
International Conference on Learning Representations
Digital storage; Graph neural networks; Open systems; Proteins
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Evaluating Representation Learning on the Protein Structure Universe / Jamasb, A. R.; Morehead, A.; Joshi, C. K.; Zhang, Z.; Didi, K.; Mathis, S.; Harris, C.; Tang, J.; Cheng, J.; Lio, P.; Blundell, T. L.. - (2024). (Intervento presentato al convegno International Conference on Learning Representations tenutosi a Hybrid, Vienna).
File allegati a questo prodotto
File Dimensione Formato  
Jamasb_Evaluating-Representation_2024.pdf

accesso aperto

Note: https://openreview.net/forum?id=sTYuRVrdK3
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 2.56 MB
Formato Adobe PDF
2.56 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1728974
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact