Basal cell carcinoma (BCC) is the most common skin cancer. Off-the-shelf multimodal large language models are widely accessible, yet their performance for BCC remains unclear. The aim of this study was to assess BCC detection (BCC vs non-BCC) and BCC subtype classification from clinical and dermoscopic images using 3 web-based large language models (ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4). We evaluated 772 images: 402 from 290 histopathology-confirmed BCCs (290 clinical, 112 dermoscopic) and 370 from an independent BCC-mimicker cohort (250 clinical, 120 dermoscopic). Standardized prompts were used. Primary outcome was BCC detection accuracy; secondary outcomes were subtype-classification accuracy and performance by lesion features. For clinical images, ChatGPT-5 achieved the highest detection accuracy (75%), followed by Claude (64.3%) and Gemini (50.7%). For dermoscopy, Claude performed best (69.8%), compared with ChatGPT-5 (55.2%) and Gemini (50.9%). Accuracy was lower in crusted and flat lesions and higher in exophytic lesions; pigmentation effects were model dependent. Subtype-classification accuracy was modest across models. Images were primarily from European centers with limited skin-type diversity; several subgroups were small. Current web-based large language models are not clinically suitable for BCC detection or subtyping. Dermatology-specific training, transparent reporting, and rigorous prospective validation are required before any clinical use.
ChatGPT, Gemini, and Claude in clinical and dermoscopic image analysis of basal cell carcinoma and its common mimickers: A comparative performance analysis / Boostani, M; Zouboulis, Cc; Pellacani, G; Navarrete-Dechent, C; Boussingault, L; Kiss, T; Goldfarb, N; Cantisani, C; Nádudvari, N; Bánvölgyi, A; Wikonkál, Nm; Suppa, M; Paragh, G; Kiss, N.. - In: JID INNOVATIONS. - ISSN 2667-0267. - (2026).
ChatGPT, Gemini, and Claude in clinical and dermoscopic image analysis of basal cell carcinoma and its common mimickers: A comparative performance analysis
Pellacani G;Cantisani C;
2026
Abstract
Basal cell carcinoma (BCC) is the most common skin cancer. Off-the-shelf multimodal large language models are widely accessible, yet their performance for BCC remains unclear. The aim of this study was to assess BCC detection (BCC vs non-BCC) and BCC subtype classification from clinical and dermoscopic images using 3 web-based large language models (ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4). We evaluated 772 images: 402 from 290 histopathology-confirmed BCCs (290 clinical, 112 dermoscopic) and 370 from an independent BCC-mimicker cohort (250 clinical, 120 dermoscopic). Standardized prompts were used. Primary outcome was BCC detection accuracy; secondary outcomes were subtype-classification accuracy and performance by lesion features. For clinical images, ChatGPT-5 achieved the highest detection accuracy (75%), followed by Claude (64.3%) and Gemini (50.7%). For dermoscopy, Claude performed best (69.8%), compared with ChatGPT-5 (55.2%) and Gemini (50.9%). Accuracy was lower in crusted and flat lesions and higher in exophytic lesions; pigmentation effects were model dependent. Subtype-classification accuracy was modest across models. Images were primarily from European centers with limited skin-type diversity; several subgroups were small. Current web-based large language models are not clinically suitable for BCC detection or subtyping. Dermatology-specific training, transparent reporting, and rigorous prospective validation are required before any clinical use.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


