The problem of grounding language in vision is increasingly attracting scholarly efforts. As of now, however, most of the approaches have been limited to word embeddings, which are not capable of handling polysemous words. This is mainly due to the limited coverage of the available semantically-annotated datasets, hence forcing research to rely on alternative technologies (i.e., image search engines). To address this issue, we introduce EViLBERT, an approach which is able to perform image classification over an open set of concepts, both concrete and non-concrete. Our approach is based on the recently introduced Vision-Language Pretraining (VLP) model, and builds upon a manually-annotated dataset of concept-image pairs. We use our technique to clean up the image-to-concept mapping that is provided within a multilingual knowledge base, resulting in over 258,000 images associated with 42,500 concepts. We show that our VLP-based model can be used to create multimodal sense embeddings starting from our automatically-created dataset. In turn, we also show that these multimodal embeddings improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline. We release code, dataset and embeddings at http://babelpic.org.
EViLBERT: Learning Task-Agnostic Multimodal Sense Embeddings / Calabrese, Agostina; Bevilacqua, Michele; Navigli, Roberto. - In: IJCAI. - ISSN 1045-0823. - (2020), pp. 481-487. (Intervento presentato al convegno Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-PRICAI 2020 tenutosi a Yokohama) [10.24963/ijcai.2020/67].
EViLBERT: Learning Task-Agnostic Multimodal Sense Embeddings
Calabrese, Agostina
;Bevilacqua, Michele
;Navigli, Roberto
2020
Abstract
The problem of grounding language in vision is increasingly attracting scholarly efforts. As of now, however, most of the approaches have been limited to word embeddings, which are not capable of handling polysemous words. This is mainly due to the limited coverage of the available semantically-annotated datasets, hence forcing research to rely on alternative technologies (i.e., image search engines). To address this issue, we introduce EViLBERT, an approach which is able to perform image classification over an open set of concepts, both concrete and non-concrete. Our approach is based on the recently introduced Vision-Language Pretraining (VLP) model, and builds upon a manually-annotated dataset of concept-image pairs. We use our technique to clean up the image-to-concept mapping that is provided within a multilingual knowledge base, resulting in over 258,000 images associated with 42,500 concepts. We show that our VLP-based model can be used to create multimodal sense embeddings starting from our automatically-created dataset. In turn, we also show that these multimodal embeddings improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline. We release code, dataset and embeddings at http://babelpic.org.File | Dimensione | Formato | |
---|---|---|---|
Calabrese_EViLBERT_2020.pdf
accesso aperto
Note: https://www.ijcai.org/Proceedings/2020/67
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
520.14 kB
Formato
Adobe PDF
|
520.14 kB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.