Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks in NLU.

MOSAICo: a Multilingual Open-text Semantically Annotated Interlinked Corpus / Conia, Simone; Barba, Edoardo; Martinez Lorenzo, Abelardo Carlos; Huguet Cabot, Pere Lluis; Orlando, Riccardo; Procopio, Luigi; Navigli, Roberto. - (2024), pp. 7990-8004. (Intervento presentato al convegno North American Association for Computational Linguistics tenutosi a Mexico City; Mexico) [10.18653/v1/2024.naacl-long.442].

MOSAICo: a Multilingual Open-text Semantically Annotated Interlinked Corpus

Conia, Simone
Co-primo
;
Barba, Edoardo
Co-primo
;
Martinez Lorenzo, Abelardo Carlos
Co-primo
;
Huguet Cabot, Pere Lluis
Co-primo
;
Orlando, Riccardo
Co-primo
;
Procopio, Luigi
Co-primo
;
Navigli, Roberto
Co-primo
2024

Abstract

Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks in NLU.
2024
North American Association for Computational Linguistics
semantic parsing; relation extraction; word sense disambiguation; semantic role labeling; multilingual NLP; natural language processing
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
MOSAICo: a Multilingual Open-text Semantically Annotated Interlinked Corpus / Conia, Simone; Barba, Edoardo; Martinez Lorenzo, Abelardo Carlos; Huguet Cabot, Pere Lluis; Orlando, Riccardo; Procopio, Luigi; Navigli, Roberto. - (2024), pp. 7990-8004. (Intervento presentato al convegno North American Association for Computational Linguistics tenutosi a Mexico City; Mexico) [10.18653/v1/2024.naacl-long.442].
File allegati a questo prodotto
File Dimensione Formato  
Conia_MOSAICo_2024.pdf

accesso aperto

Note: https://aclanthology.org/2024.naacl-long.442.pdf
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 439.26 kB
Formato Adobe PDF
439.26 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1726493
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact