Motivation: Gene expression data are commonly used at the intersection of cancer research and machine learning for better understanding of the molecular status of tumour tissue. Deep learning predictive models have been employed for gene expression data due to their ability to scale and remove the need for manual feature engineering. However, gene expression data are often very high dimensional, noisy and presented with a low number of samples. This poses significant problems for learning algorithms: models often overfit, learn noise and struggle to capture biologically relevant information. In this article, we utilize external biological knowledge embedded within structures of gene interaction graphs such as protein-protein interaction (PPI) networks to guide the construction of predictive models. Results: We present Gene Interaction Network Constrained Construction (GINCCo), an unsupervised method for automated construction of computational graph models for gene expression data that are structurally constrained by prior knowledge of gene interaction networks. We employ this methodology in a case study on incorporating a PPI network in cancer phenotype prediction tasks. Our computational graphs are structurally constructed using topological clustering algorithms on the PPI networks which incorporate inductive biases stemming from network biology research on protein complex discovery. Each of the entities in the GINCCo computational graph represents biological entities such as genes, candidate protein complexes and phenotypes instead of arbitrary hidden nodes of a neural network. This provides a biologically relevant mechanism for model regularization yielding strong predictive performance while drastically reducing the number of model parameters and enabling guided post-hoc enrichment analyses of influential gene sets with respect to target phenotypes. Our experiments analysing a variety of cancer phenotypes show that GINCCo often outperforms support vector machine, Fully Connected Multi-layer Perceptrons (MLP) and Randomly Connected MLPs despite greatly reduced model complexity.

Unsupervised construction of computational graphs for gene expression data with explicit structural inductive biases / Scherer, P.; Trebacz, M.; Simidjievski, N.; Vinas, R.; Shams, Z.; Terre, H. A.; Jamnik, M.; Lio, P.. - In: BIOINFORMATICS. - ISSN 1367-4803. - 38:5(2022), pp. 1320-1327. [10.1093/bioinformatics/btab830]

Unsupervised construction of computational graphs for gene expression data with explicit structural inductive biases

Lio P.
2022

Abstract

Motivation: Gene expression data are commonly used at the intersection of cancer research and machine learning for better understanding of the molecular status of tumour tissue. Deep learning predictive models have been employed for gene expression data due to their ability to scale and remove the need for manual feature engineering. However, gene expression data are often very high dimensional, noisy and presented with a low number of samples. This poses significant problems for learning algorithms: models often overfit, learn noise and struggle to capture biologically relevant information. In this article, we utilize external biological knowledge embedded within structures of gene interaction graphs such as protein-protein interaction (PPI) networks to guide the construction of predictive models. Results: We present Gene Interaction Network Constrained Construction (GINCCo), an unsupervised method for automated construction of computational graph models for gene expression data that are structurally constrained by prior knowledge of gene interaction networks. We employ this methodology in a case study on incorporating a PPI network in cancer phenotype prediction tasks. Our computational graphs are structurally constructed using topological clustering algorithms on the PPI networks which incorporate inductive biases stemming from network biology research on protein complex discovery. Each of the entities in the GINCCo computational graph represents biological entities such as genes, candidate protein complexes and phenotypes instead of arbitrary hidden nodes of a neural network. This provides a biologically relevant mechanism for model regularization yielding strong predictive performance while drastically reducing the number of model parameters and enabling guided post-hoc enrichment analyses of influential gene sets with respect to target phenotypes. Our experiments analysing a variety of cancer phenotypes show that GINCCo often outperforms support vector machine, Fully Connected Multi-layer Perceptrons (MLP) and Randomly Connected MLPs despite greatly reduced model complexity.
2022
Algorithms; Bias; Computational Biology; Gene Expression; Humans; Neoplasms; Neural Networks, Computer; Software
01 Pubblicazione su rivista::01a Articolo in rivista
Unsupervised construction of computational graphs for gene expression data with explicit structural inductive biases / Scherer, P.; Trebacz, M.; Simidjievski, N.; Vinas, R.; Shams, Z.; Terre, H. A.; Jamnik, M.; Lio, P.. - In: BIOINFORMATICS. - ISSN 1367-4803. - 38:5(2022), pp. 1320-1327. [10.1093/bioinformatics/btab830]
File allegati a questo prodotto
File Dimensione Formato  
Scherer_Unsupervised_2022.pdf

accesso aperto

Note: DOI 10.1093/bioinformatics/btab830
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 733.81 kB
Formato Adobe PDF
733.81 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1723940
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 1
social impact