Despite being one of the most popular tasks in lexical semantics, word similarity has often been limited to the English language. Other languages, even those that are widely spoken such as Spanish, do not have a reliable word similarity evaluation framework. We put forward robust methodologies for the extension of existing English datasets to other languages, both at monolingual and cross-lingual levels. We propose an automatic standardization for the construction of cross-lingual similarity datasets, and provide an evaluation, demonstrating its reliability and robustness. Based on our procedure and taking the RG-65 word similarity dataset as a reference, we release two high-quality Spanish and Farsi (Persian) monolingual datasets, and fifteen cross-lingual datasets for six languages: English, Spanish, French, German, Portuguese, and Farsi.

A framework for the construction of monolingual and Cross-lingual Semantic Similarity Datasets / CAMACHO COLLADOS, Jose'; Pilehvar, Mohammed Taher; Navigli, Roberto. - ELETTRONICO. - (2015), pp. 1-7. (Intervento presentato al convegno ACL 2015 tenutosi a Beijing, China).

A framework for the construction of monolingual and Cross-lingual Semantic Similarity Datasets

CAMACHO COLLADOS, JOSE';NAVIGLI, ROBERTO
2015

Abstract

Despite being one of the most popular tasks in lexical semantics, word similarity has often been limited to the English language. Other languages, even those that are widely spoken such as Spanish, do not have a reliable word similarity evaluation framework. We put forward robust methodologies for the extension of existing English datasets to other languages, both at monolingual and cross-lingual levels. We propose an automatic standardization for the construction of cross-lingual similarity datasets, and provide an evaluation, demonstrating its reliability and robustness. Based on our procedure and taking the RG-65 word similarity dataset as a reference, we release two high-quality Spanish and Farsi (Persian) monolingual datasets, and fifteen cross-lingual datasets for six languages: English, Spanish, French, German, Portuguese, and Farsi.
2015
ACL 2015
Semantics; Knowledge based systems; disambiguation WSD
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
A framework for the construction of monolingual and Cross-lingual Semantic Similarity Datasets / CAMACHO COLLADOS, Jose'; Pilehvar, Mohammed Taher; Navigli, Roberto. - ELETTRONICO. - (2015), pp. 1-7. (Intervento presentato al convegno ACL 2015 tenutosi a Beijing, China).
File allegati a questo prodotto
File Dimensione Formato  
Navigli_Framework_2015pdf

accesso aperto

Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 748.69 kB
Formato Unknown
748.69 kB Unknown

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/845336
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 48
  • ???jsp.display-item.citation.isi??? ND
social impact