The evaluation of large language models for Italian faces unique challenges due to morphosyntactic complexity, dialectal variation, cultural-specific knowledge, and limited availability of computational resources. This position paper presents a comprehensive framework for Italian LLM benchmarking, in which we identify key dimensions for LLM evaluation, including linguistic capabilities, knowledge domains, task types and prompt variations, proposing high-level methodological guidelines for current and future initiatives. We advocate a community-driven, sustainable benchmarking initiative that incorporates dynamic dataset management, open model prioritization, and collaborative infrastructure utilization. Our framework aims to establish a coordinated effort within the Italian NLP community to ensure rigorous, scientifically sound evaluation practices that can adapt to the evolving landscape of Italian LLMs.
Sustainable Italian LLM Evaluation: Community Perspectives and Methodological Guidelines / Moroni, Luca; Pappacoda, Gianmarco; Barba, Edoardo; Conia, Simone; Galassi, Andrea; Magnini, Bernardo; Navigli, Roberto; Torroni, Paolo; Zanoli, Roberto. - (2025), pp. 747-759. ( the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) Cagliari, Italia ).
Sustainable Italian LLM Evaluation: Community Perspectives and Methodological Guidelines
Luca Moroni
;Edoardo Barba;Simone Conia;Andrea Galassi;Roberto Navigli;
2025
Abstract
The evaluation of large language models for Italian faces unique challenges due to morphosyntactic complexity, dialectal variation, cultural-specific knowledge, and limited availability of computational resources. This position paper presents a comprehensive framework for Italian LLM benchmarking, in which we identify key dimensions for LLM evaluation, including linguistic capabilities, knowledge domains, task types and prompt variations, proposing high-level methodological guidelines for current and future initiatives. We advocate a community-driven, sustainable benchmarking initiative that incorporates dynamic dataset management, open model prioritization, and collaborative infrastructure utilization. Our framework aims to establish a coordinated effort within the Italian NLP community to ensure rigorous, scientifically sound evaluation practices that can adapt to the evolving landscape of Italian LLMs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


