The problem of grounding language in vision is increasingly attracting scholarly efforts. As of now, however, most of the approaches have been limited to word embeddings, which are not capable of handling polysemous words. This is mainly due to the limited coverage of the available semantically-annotated datasets, hence forcing research to rely on alternative technologies (i.e., image search engines). To address this issue, we introduce EViLBERT, an approach which is able to perform image classification over an open set of concepts, both concrete and non-concrete. Our approach is based on the recently introduced Vision-Language Pretraining (VLP) model, and builds upon a manually-annotated dataset of concept-image pairs. We use our technique to clean up the image-to-concept mapping that is provided within a multilingual knowledge base, resulting in over 258,000 images associated with 42,500 concepts. We show that our VLP-based model can be used to create multimodal sense embeddings starting from our automatically-created dataset. In turn, we also show that these multimodal embeddings improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline. We release code, dataset and embeddings at http://babelpic.org.

EViLBERT: Learning Task-Agnostic Multimodal Sense Embeddings / Calabrese, Agostina; Bevilacqua, Michele; Navigli, Roberto. - In: IJCAI. - ISSN 1045-0823. - (2020), pp. 481-487. (Intervento presentato al convegno Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-PRICAI 2020 tenutosi a Yokohama) [10.24963/ijcai.2020/67].

EViLBERT: Learning Task-Agnostic Multimodal Sense Embeddings

Calabrese, Agostina
;
Bevilacqua, Michele
;
Navigli, Roberto
2020

Abstract

The problem of grounding language in vision is increasingly attracting scholarly efforts. As of now, however, most of the approaches have been limited to word embeddings, which are not capable of handling polysemous words. This is mainly due to the limited coverage of the available semantically-annotated datasets, hence forcing research to rely on alternative technologies (i.e., image search engines). To address this issue, we introduce EViLBERT, an approach which is able to perform image classification over an open set of concepts, both concrete and non-concrete. Our approach is based on the recently introduced Vision-Language Pretraining (VLP) model, and builds upon a manually-annotated dataset of concept-image pairs. We use our technique to clean up the image-to-concept mapping that is provided within a multilingual knowledge base, resulting in over 258,000 images associated with 42,500 concepts. We show that our VLP-based model can be used to create multimodal sense embeddings starting from our automatically-created dataset. In turn, we also show that these multimodal embeddings improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline. We release code, dataset and embeddings at http://babelpic.org.
2020
Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-PRICAI 2020
multimodality; natual language processing; computer vision
04 Pubblicazione in atti di convegno::04c Atto di convegno in rivista
EViLBERT: Learning Task-Agnostic Multimodal Sense Embeddings / Calabrese, Agostina; Bevilacqua, Michele; Navigli, Roberto. - In: IJCAI. - ISSN 1045-0823. - (2020), pp. 481-487. (Intervento presentato al convegno Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-PRICAI 2020 tenutosi a Yokohama) [10.24963/ijcai.2020/67].
File allegati a questo prodotto
File Dimensione Formato  
Calabrese_EViLBERT_2020.pdf

accesso aperto

Note: https://www.ijcai.org/Proceedings/2020/67
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 520.14 kB
Formato Adobe PDF
520.14 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1431898
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 0
social impact