Background Large language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.Methods We evaluated ChatGPT4.0, Gemini Advanced, and GPTo1-preview on ten clinical questions. Eight derived from TG18, and two were formulated by the authors. Two authors independently rated the accuracy of each LLM's responses on a four-point scale: (1) accurate and comprehensive, (2) accurate but not comprehensive, (3) partially accurate, partially inaccurate, and (4) entirely inaccurate. A third author resolved any scoring discrepancies. Then, we comparatively analyzed the performance of ChatGPT4.0 against newer large language models (LLMs), specifically Gemini Advanced and GPTo1-preview, on the same set of questions to delineate their respective strengths and limitations.Results ChatGPT4.0 provided consistent responses for 90% of the questions. It delivered "accurate and comprehensive" answers for 4/10 (40%) questions and "accurate but not comprehensive" answers for 5/10 (50%). One response (10%) was rated as "partially accurate, partially inaccurate." Gemini Advanced demonstrated higher accuracy on some questions but yielded a similar percentage of "partially accurate, partially inaccurate" responses. Notably, neither model produced "entirely inaccurate" answers.Discussion LLMs, such as ChatGPT and Gemini Advanced, demonstrate potential in accurately addressing clinical questions regarding acute cholecystitis. With awareness of their limitations, their careful implementation, and ongoing refinement, LLMs could serve as valuable resources for physician education and patient information, potentially improving clinical decision-making in the future.
Using large language models in the diagnosis of acute cholecystitis: assessing accuracy and guidelines compliance / Goglia, Marta; Cicolani, Arianna; Carrano, Francesco Maria; Petrucciani, Niccolo; Dangelo, Francesco; Pace, Marco; Chiarini, Lucio; Silecchia, Gianfranco; Aurello, Paolo. - In: THE AMERICAN SURGEON. - ISSN 0003-1348. - 91:6(2025), pp. 967-977. [10.1177/00031348251323719]
Using large language models in the diagnosis of acute cholecystitis: assessing accuracy and guidelines compliance
Marta Goglia;Arianna Cicolani;Francesco Maria Carrano
;Niccolo Petrucciani;Francesco DAngelo;Marco Pace;Lucio Chiarini;Gianfranco Silecchia;Paolo Aurello
2025
Abstract
Background Large language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.Methods We evaluated ChatGPT4.0, Gemini Advanced, and GPTo1-preview on ten clinical questions. Eight derived from TG18, and two were formulated by the authors. Two authors independently rated the accuracy of each LLM's responses on a four-point scale: (1) accurate and comprehensive, (2) accurate but not comprehensive, (3) partially accurate, partially inaccurate, and (4) entirely inaccurate. A third author resolved any scoring discrepancies. Then, we comparatively analyzed the performance of ChatGPT4.0 against newer large language models (LLMs), specifically Gemini Advanced and GPTo1-preview, on the same set of questions to delineate their respective strengths and limitations.Results ChatGPT4.0 provided consistent responses for 90% of the questions. It delivered "accurate and comprehensive" answers for 4/10 (40%) questions and "accurate but not comprehensive" answers for 5/10 (50%). One response (10%) was rated as "partially accurate, partially inaccurate." Gemini Advanced demonstrated higher accuracy on some questions but yielded a similar percentage of "partially accurate, partially inaccurate" responses. Notably, neither model produced "entirely inaccurate" answers.Discussion LLMs, such as ChatGPT and Gemini Advanced, demonstrate potential in accurately addressing clinical questions regarding acute cholecystitis. With awareness of their limitations, their careful implementation, and ongoing refinement, LLMs could serve as valuable resources for physician education and patient information, potentially improving clinical decision-making in the future.| File | Dimensione | Formato | |
|---|---|---|---|
|
Goglia_Using_2025.pdf
solo gestori archivio
Note: Online ahead of print
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
618.53 kB
Formato
Adobe PDF
|
618.53 kB | Adobe PDF | Contatta l'autore |
|
Goglia_Using_2025.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
618.53 kB
Formato
Adobe PDF
|
618.53 kB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


