Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact Large language models (LLMs), offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full-text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14–$1.93 per review). Llama 3.1-8B and Gemma 2-9B, were run locally in longer times (~4 hours max.) and were free-to-use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.
Compact Large Language Models for Title and Abstract Screening in Systematic Reviews: An Assessment of Feasibility, Accuracy, and Workload Reduction / Sciurti, Antonio; Migliara, Giuseppe; Siena, Leonardo Maria; Isonne, Claudia; De Blasiis, Maria Roberta; Sinopoli, Alessandra; Iera, Jessica; Marzuillo, Carolina; De Vito, Corrado; Villari, Paolo; Baccolini, Valentina. - In: RESEARCH SYNTHESIS METHODS. - ISSN 1759-2887. - (2025).
Compact Large Language Models for Title and Abstract Screening in Systematic Reviews: An Assessment of Feasibility, Accuracy, and Workload Reduction
Antonio Sciurti;Giuseppe Migliara;Leonardo Maria Siena
;Claudia Isonne;Maria Roberta De Blasiis;Alessandra Sinopoli;Jessica Iera;Carolina Marzuillo;Corrado De Vito;Paolo Villari;Valentina Baccolini
2025
Abstract
Systematic reviews play a critical role in evidence-based research but are labor-intensive, especially during title and abstract screening. Compact Large language models (LLMs), offer potential to automate this process, balancing time/cost requirements and accuracy. The aim of this study is to assess the feasibility, accuracy, and workload reduction by three compact LLMs (GPT-4o mini, Llama 3.1 8B and Gemma 2 9B) in screening titles and abstracts. Records were sourced from three previously published systematic reviews and LLMs were requested to rate each record from 0 to 100 for inclusion, using a structured prompt. Predefined 25-, 50-, 75-rating thresholds were used to compute performance metrics (balanced accuracy, sensitivity, specificity, positive and negative predictive value and workload-saving). Processing time and costs were registered. Across the systematic reviews, LLMs achieved high sensitivity (up to 100%) and low precision (below 10%) for records included by full-text. Specificity and workload savings improved at higher thresholds, with the 50- and 75-rating thresholds offering optimal trade-offs. GPT-4o-mini, accessed via application programming interface, was the fastest model (~40 minutes max.) and had usage costs ($0.14–$1.93 per review). Llama 3.1-8B and Gemma 2-9B, were run locally in longer times (~4 hours max.) and were free-to-use. LLMs were highly sensitive tools for the title/abstract screening process. High specificity values were reached, allowing for significant workload savings, at reasonable costs and processing time. Conversely, we found them to be imprecise. However, high sensitivity and workload reduction are key factors for their usage in the title/abstract screening phase of systematic reviews.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


