Thermophilic proteins (TPPs) are important in the field of protein biochemistry and development of new enzymes. Thus, computational methods must be urgently developed to accurately and rapidly identify TPPs. To date, several computational methods have been developed for TPP identification; however, few limitations in terms of performance and utility remain. In this study, we present a novel computational method, SAPPHIRE, to achieve more accurate identification of TPPs using only sequence information without any need for structural information. We combined twelve different feature encodings representing different perspectives and six popular machine learning algorithms to train 72 baseline models and extract the key information of TPPs. Subsequently, the informative predicted probabilities from the baseline models were mined and selected using a genetic algorithm in conjunction with a self-assessment-report approach. Finally, the final meta-predictor, SAPPHIRE, was built and optimized by applying an optimal feature set. The performance of SAPPHIRE in the 10-fold cross-validation test showed that a superior predictive performance compared with several baseline models could be achieved. Moreover, SAPPHIRE yielded an accuracy of 0.942 and Matthew's coefficient correlation of 0.884, which were 7.68 and 5.12% higher than those of the current existing methods, respectively, as indicated by the independent test. The proposed computational approach is anticipated to facilitate large-scale identification of TPPs and accelerate their applications in the food industry. The codes and datasets are available at https://github.com/plenoi/SAPPHIRE.

SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins / Charoenkwan, P.; Schaduangrat, N.; Moni, M. A.; Lio, P.; Manavalan, B.; Shoombuatong, W.. - In: COMPUTERS IN BIOLOGY AND MEDICINE. - ISSN 0010-4825. - 146:(2022). [10.1016/j.compbiomed.2022.105704]

SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins

Lio P.;
2022

Abstract

Thermophilic proteins (TPPs) are important in the field of protein biochemistry and development of new enzymes. Thus, computational methods must be urgently developed to accurately and rapidly identify TPPs. To date, several computational methods have been developed for TPP identification; however, few limitations in terms of performance and utility remain. In this study, we present a novel computational method, SAPPHIRE, to achieve more accurate identification of TPPs using only sequence information without any need for structural information. We combined twelve different feature encodings representing different perspectives and six popular machine learning algorithms to train 72 baseline models and extract the key information of TPPs. Subsequently, the informative predicted probabilities from the baseline models were mined and selected using a genetic algorithm in conjunction with a self-assessment-report approach. Finally, the final meta-predictor, SAPPHIRE, was built and optimized by applying an optimal feature set. The performance of SAPPHIRE in the 10-fold cross-validation test showed that a superior predictive performance compared with several baseline models could be achieved. Moreover, SAPPHIRE yielded an accuracy of 0.942 and Matthew's coefficient correlation of 0.884, which were 7.68 and 5.12% higher than those of the current existing methods, respectively, as indicated by the independent test. The proposed computational approach is anticipated to facilitate large-scale identification of TPPs and accelerate their applications in the food industry. The codes and datasets are available at https://github.com/plenoi/SAPPHIRE.
2022
Bioinformatics; Feature selection; Machine learning; Sequence analysis; Stacking strategy; Thermophilic protein
01 Pubblicazione su rivista::01a Articolo in rivista
SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins / Charoenkwan, P.; Schaduangrat, N.; Moni, M. A.; Lio, P.; Manavalan, B.; Shoombuatong, W.. - In: COMPUTERS IN BIOLOGY AND MEDICINE. - ISSN 0010-4825. - 146:(2022). [10.1016/j.compbiomed.2022.105704]
File allegati a questo prodotto
File Dimensione Formato  
Charoenkwan_SAPPHIRE_2022.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 7.32 MB
Formato Adobe PDF
7.32 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1721276
Citazioni
  • ???jsp.display-item.citation.pmc??? 11
  • Scopus 41
  • ???jsp.display-item.citation.isi??? 36
social impact