This article introduces the PEC24, an extension of the Perugia corpus, as a new reference corpus for Italian. The update mainly concerned the size of the corpus, which now consists of approximately 47 million tokens, with an addition of over 100,000 texts. The PEC24 maintains the same structure as its predecessor, divided into 10 sections, representing ten different written and spoken genres. In this article, after reviewing the spoken, written, and web corpora available for the Italian language, the internal composition of each section of the corpus will be described, followed by an explanation of how the corpus was annotated. Further, as the PEC24 is available and searchable online, examples of how it can be queried will be illustrated. In conclusion, the PEC24 represents a significant advancement in the panorama of Italian corpora, offering a representative and more comprehensive resource for linguistic research and corpus-bases studies.
From PEC to PEC24: a new reference corpus for Italian / Spina, S; Zanda, F; Fioravanti, I. - In: ITALIANO LINGUADUE. - ISSN 2037-3597. - 17:1(2025), pp. 745-768. [10.54103/2037-3597/29101]
From PEC to PEC24: a new reference corpus for Italian
Spina S;Zanda F;
2025
Abstract
This article introduces the PEC24, an extension of the Perugia corpus, as a new reference corpus for Italian. The update mainly concerned the size of the corpus, which now consists of approximately 47 million tokens, with an addition of over 100,000 texts. The PEC24 maintains the same structure as its predecessor, divided into 10 sections, representing ten different written and spoken genres. In this article, after reviewing the spoken, written, and web corpora available for the Italian language, the internal composition of each section of the corpus will be described, followed by an explanation of how the corpus was annotated. Further, as the PEC24 is available and searchable online, examples of how it can be queried will be illustrated. In conclusion, the PEC24 represents a significant advancement in the panorama of Italian corpora, offering a representative and more comprehensive resource for linguistic research and corpus-bases studies.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


