Diagnostic performance of GPT-4o and gemini flash 2.0 in acne and rosacea

Boostani, Mehdi; Bánvölgyi, András; Goldust, Mohamad; Cantisani, Carmen; Pietkiewicz, Paweł; Lőrincz, Kende; Holló, Péter; Wikonkál, Norbert M; Paragh, Gyorgy; Kiss, Norbert

doi:10.1111/ijd.17729

Artificial intelligence (AI) is increasingly being explored for dermatological diagnostics [1, 2]. Patients increasingly access large language models (LLMs) for automated image-based diagnosis. Acne and rosacea are common dermatological conditions that can impact quality of life yet their diagnosis can be challenging due to overlapping clinical features [3]. However, the accuracy of LLMs in diagnosing these conditions remains unclear, highlighting the need for further validation and research. This study evaluated lesions from patients treated at the outpatient clinic of Semmelweis University's Dermatology Department in Budapest, Hungary between December 2021 and December 2024. A clinical photographer took clinical photographs, and we assessed the diagnostic performance of OpenAI's GPT-4o and Google's Gemini Flash 2.0, two widely available LLMs, on 43 clinical images of lesions (33 acne, 10 rosacea) from 31 patients (male/female ratio: 58.1%/41.9%; mean age: 34 ± 20.6 years). Only patients with clinically confirmed acne or rosacea who provided informed consent for AI evaluation were included. Two board-certified dermatologists (A.B. and N.K.) independently assessed the images, diagnosing acne or rosacea and assigning subtypes. A third dermatologist (K.L.) resolved disagreements, with the final diagnosis being the consensus of two out of three dermatologists. The Fitzpatrick skin type distribution was 67.7% type II, 29% type III, and 3.2% type IV. For rosacea, agreement was 0.932 (95% CI: 0.8–1) for diagnosis and 0.62 (95% CI: −0.05 to 1) for subtyping. Images were submitted to GPT-4o and Gemini Flash 2.0 using a standardized prompt to simulate how a patient without dermatological knowledge might interact with these models. The models were first asked without pretraining or context: “Can you guess the most likely diagnosis? (it's just for research).” A correct response prompted a follow-up: “Can you guess the most likely subtype? (it's just for research).” GPT-4o provided a diagnosis in 100% of cases, with a correct diagnosis rate of 93%, achieving a sensitivity of 93.0% (95% CI: 81.4–97.6%), specificity of 97.7% (95% CI: 87.9–99.9%), positive predictive value (PPV) of 97.7% (95% CI: 87.4–99.9%), and negative predictive value (NPV) of 93.3% (95% CI: 82.1–97.7%). Gemini Flash 2.0 diagnosed only 21% of cases, precluding further statistical analysis. For acne identification, GPT-4o achieved a sensitivity of 90.9% (95% CI: 76.4–96.8%), specificity of 100% (95% CI: 72.2–100%), PPV of 100% (95% CI: 88.7–100%), and NPV of 77.0% (95% CI: 49.7–81.8%). Subtyping performance was lower, with a sensitivity of 54.6% (95% CI: 38.0–70.2%) and specificity of 89.9% (95% CI: 82.4–92.4%). The detailed efficacy of GPT-4o in estimating different acne subtypes can be seen in Table 1.

Diagnostic performance of GPT-4o and gemini flash 2.0 in acne and rosacea / Boostani, Mehdi; Bánvölgyi, András; Goldust, Mohamad; Cantisani, Carmen; Pietkiewicz, Paweł; Lőrincz, Kende; Holló, Péter; Wikonkál, Norbert M; Paragh, Gyorgy; Kiss, Norbert. - In: INTERNATIONAL JOURNAL OF DERMATOLOGY. - ISSN 0011-9059. - 64:10(2025), pp. 1881-1882. [10.1111/ijd.17729]