Complex diseases present challenges in disease prediction due to their multifactorial nature. Unlike single-gene disorders, these diseases result from the interplay of multiple genetic, environmental, and lifestyle factors. In parallel, Machine learning (ML) and deep learning (DL) techniques have gained popularity for predicting phenotypic traits and disease conditions based on different types of clinical data, including genomic data. In this work, sometimes I refer to ML and DL as separate methods, but it is important to note that, while DL is a more specialized and sophisticated branch of ML, it still falls under the broader umbrella of ML techniques. These methods have been proven to be powerful in detecting complex patterns, including epistasis in the data. Alternatively, one of the most common methods used in population genomics to estimate the genomic predisposition to develop a disease is the polygenic risk score (PRS). In my work I hypothesised that ML methods could be useful for classifying individuals with complex diseases, due to their ability to capture complex patterns and synergisms in the data. Consequently, I explored the prediction of four different complex diseases, multiple sclerosis (MS), Alzheimer’s disease (AD), schizophrenia (SC), and Parkinson’s disease (PD) using ML models with genomic data. The primary goal of this research was to investigate the robustness and variability of the ML methods. Different models were tested to classify affected and healthy individuals, and their performance was compared. The main results of this part are summarized below: • Logistic regression appeared to be the most robust method across folds and diseases. Alternatively, DL methods exhibited high variability across folds. These results may partially be attributed to the limited sample size available in this study, which could have favored simpler methods. • Regarding the impact of biases present in the data, for diseases with imbalanced sex representation, the models tended to reproduce this imbalance in the predictions of the testing set, highlighting a common limitation associated with biases in the application of ML methods. • When comparing the performance of PRS with ML methods, PRS consistently performed at an average level. Therefore, I concluded that, with the available sample size, both methods are comparable in stratifying individuals by disease risk. However, PRS still offers several practical advantages over ML methods. • After implementing feature selection techniques to exclude non-informative predictors from the models, the performance of ML models did not improve. This underscores the capacity of ML methods to achieve optimal performance, even in the presence of correlated features due to linkage disequilibrium. Understanding which genomic variants are considered informative for disease discrimination during the training process could provide significant insights into the underlying genetic basis of the diseases and identify potential targets for further investigation. Related to this, the secondary goal of this study was to apply explainability tools to extract the features considered more informative by the models. The main results of this part are discussed below: • The results confirmed the polygenicity of MS, as evidenced by the prioritized genomic features distributed across different chromosomes. • The prevalence of HLA gene annotations among the top genomic features on chromosome 6 aligns with their significance in the context of MS. • The highest-prioritized genomic variants were identified as expression or splicing quantitative trait loci (eQTL or sQTL) located in non-coding regions within or near genes associated with the immune response and MS. Overall, given that ML are self-learning methods and are increasingly popular for clinical applications, this research provides a deeper understanding of how these methods learn to classify complex diseases.

Machine learning methods applied to classify complex diseases using genomic data / ARNAL SEGURA, Magdalena. - (2024 Mar 22).

Machine learning methods applied to classify complex diseases using genomic data

ARNAL SEGURA, MAGDALENA
22/03/2024

Abstract

Complex diseases present challenges in disease prediction due to their multifactorial nature. Unlike single-gene disorders, these diseases result from the interplay of multiple genetic, environmental, and lifestyle factors. In parallel, Machine learning (ML) and deep learning (DL) techniques have gained popularity for predicting phenotypic traits and disease conditions based on different types of clinical data, including genomic data. In this work, sometimes I refer to ML and DL as separate methods, but it is important to note that, while DL is a more specialized and sophisticated branch of ML, it still falls under the broader umbrella of ML techniques. These methods have been proven to be powerful in detecting complex patterns, including epistasis in the data. Alternatively, one of the most common methods used in population genomics to estimate the genomic predisposition to develop a disease is the polygenic risk score (PRS). In my work I hypothesised that ML methods could be useful for classifying individuals with complex diseases, due to their ability to capture complex patterns and synergisms in the data. Consequently, I explored the prediction of four different complex diseases, multiple sclerosis (MS), Alzheimer’s disease (AD), schizophrenia (SC), and Parkinson’s disease (PD) using ML models with genomic data. The primary goal of this research was to investigate the robustness and variability of the ML methods. Different models were tested to classify affected and healthy individuals, and their performance was compared. The main results of this part are summarized below: • Logistic regression appeared to be the most robust method across folds and diseases. Alternatively, DL methods exhibited high variability across folds. These results may partially be attributed to the limited sample size available in this study, which could have favored simpler methods. • Regarding the impact of biases present in the data, for diseases with imbalanced sex representation, the models tended to reproduce this imbalance in the predictions of the testing set, highlighting a common limitation associated with biases in the application of ML methods. • When comparing the performance of PRS with ML methods, PRS consistently performed at an average level. Therefore, I concluded that, with the available sample size, both methods are comparable in stratifying individuals by disease risk. However, PRS still offers several practical advantages over ML methods. • After implementing feature selection techniques to exclude non-informative predictors from the models, the performance of ML models did not improve. This underscores the capacity of ML methods to achieve optimal performance, even in the presence of correlated features due to linkage disequilibrium. Understanding which genomic variants are considered informative for disease discrimination during the training process could provide significant insights into the underlying genetic basis of the diseases and identify potential targets for further investigation. Related to this, the secondary goal of this study was to apply explainability tools to extract the features considered more informative by the models. The main results of this part are discussed below: • The results confirmed the polygenicity of MS, as evidenced by the prioritized genomic features distributed across different chromosomes. • The prevalence of HLA gene annotations among the top genomic features on chromosome 6 aligns with their significance in the context of MS. • The highest-prioritized genomic variants were identified as expression or splicing quantitative trait loci (eQTL or sQTL) located in non-coding regions within or near genes associated with the immune response and MS. Overall, given that ML are self-learning methods and are increasingly popular for clinical applications, this research provides a deeper understanding of how these methods learn to classify complex diseases.
22-mar-2024
File allegati a questo prodotto
File Dimensione Formato  
Tesi_dottorato_ArnalSegura.pdf

accesso aperto

Note: Tesi di dottorato: Machine learning methods applied to classify complex diseases using genomic data
Tipologia: Tesi di dottorato
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 6.43 MB
Formato Adobe PDF
6.43 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1706863
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact