Objectives: Deep learning (DL)-driven artificial intelligence (AI) image classification models hold potential as important referral adjuncts for healthcare providers not trained in oral mucosal diseases. The aim of this multicenter pilot-study was to compare the performance of different AI models in the classification of oral mucosal lesions. Methods: Retrospective photographic images from 4 different oral medicine centers (Sapienza University of Rome, University of Maryland Baltimore, UT Health San Antonio, Johns Hopkins University) were used to train/validate a DL model (BEiT-BERT Pre-Training of Image Transformers) to classify clinical images from 5 different categories: normal mucosa (NM), reactive/benign (RB), Immuno-mediated (IM), potentially-malignant (PM), oral cancer (OC). The model was based on pre-trained 87M-parameters trained on a ImageNet-21k-dataset and fine-tuned on ImageNet 2012-dataset at 224x224 resolution. The BEiT model was then compared to 6 machine learning benchmark models (kNN 1.0, kNN 2.0, Logistic Regression 1.0, Logistic Regression 2.0, Random Forest 1.0, Random Forest 2.0), with Google Inception V3 or SqueezeNet visual embedding DL methods. Sensitivity/Recall, Specificity, Precision, and F1-score over the classes, were evaluated as performance metrics. Results: The dataset included 1608 clinical images from 824 patients, divided into two sub-datasets, respectively, for training (80%; n=1157, training; n=129, validation) and testing (20%, N=322). Comparing to the 6 benchmark models, BEiT model achieved the best scores when assessing for overall accuracy. The BEiT model demonstrated an overall accuracy of 79.2% over the five classes, with a mean accuracy of each class of 76.5%. Overall, the highest accuracy between the classes was achieved by the IM group with 84% and followed by RB (82%) and PM (78%), whereas the highest specificity was reached by the cancer group (97%) and followed by NM (96%) and PM (94%). The class with the highest sensitivity was the IM (84%) whereas the class with the highest precision was the PM (89%). Overall, the BEiT model outperformed the other six off-the-shelf models in most of the performance metrics. Conclusions: In this pilot-study, the custom developed DL-model (BEiT) had a high performance on IM, PM and OC classes, outperformed six different off-the-shelf tools and may hold promise for future real-world AI applicability that will assist referring providers not specifically trained in oral mucosal diseases.
Deep-Learning Based Image Classification of Oral Mucosal Lesions. A Multicenter Pilot Study Evaluating Real-World Applicability / Fantozzi, Paolo Junior. - (2026 Jan 21).
Deep-Learning Based Image Classification of Oral Mucosal Lesions. A Multicenter Pilot Study Evaluating Real-World Applicability.
FANTOZZI, Paolo Junior
21/01/2026
Abstract
Objectives: Deep learning (DL)-driven artificial intelligence (AI) image classification models hold potential as important referral adjuncts for healthcare providers not trained in oral mucosal diseases. The aim of this multicenter pilot-study was to compare the performance of different AI models in the classification of oral mucosal lesions. Methods: Retrospective photographic images from 4 different oral medicine centers (Sapienza University of Rome, University of Maryland Baltimore, UT Health San Antonio, Johns Hopkins University) were used to train/validate a DL model (BEiT-BERT Pre-Training of Image Transformers) to classify clinical images from 5 different categories: normal mucosa (NM), reactive/benign (RB), Immuno-mediated (IM), potentially-malignant (PM), oral cancer (OC). The model was based on pre-trained 87M-parameters trained on a ImageNet-21k-dataset and fine-tuned on ImageNet 2012-dataset at 224x224 resolution. The BEiT model was then compared to 6 machine learning benchmark models (kNN 1.0, kNN 2.0, Logistic Regression 1.0, Logistic Regression 2.0, Random Forest 1.0, Random Forest 2.0), with Google Inception V3 or SqueezeNet visual embedding DL methods. Sensitivity/Recall, Specificity, Precision, and F1-score over the classes, were evaluated as performance metrics. Results: The dataset included 1608 clinical images from 824 patients, divided into two sub-datasets, respectively, for training (80%; n=1157, training; n=129, validation) and testing (20%, N=322). Comparing to the 6 benchmark models, BEiT model achieved the best scores when assessing for overall accuracy. The BEiT model demonstrated an overall accuracy of 79.2% over the five classes, with a mean accuracy of each class of 76.5%. Overall, the highest accuracy between the classes was achieved by the IM group with 84% and followed by RB (82%) and PM (78%), whereas the highest specificity was reached by the cancer group (97%) and followed by NM (96%) and PM (94%). The class with the highest sensitivity was the IM (84%) whereas the class with the highest precision was the PM (89%). Overall, the BEiT model outperformed the other six off-the-shelf models in most of the performance metrics. Conclusions: In this pilot-study, the custom developed DL-model (BEiT) had a high performance on IM, PM and OC classes, outperformed six different off-the-shelf tools and may hold promise for future real-world AI applicability that will assist referring providers not specifically trained in oral mucosal diseases.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


