Architectures that model language and vision together havereceived much attention in recent years. Nonetheless, most tasks in this field focus on end-to-end applications without providing insights on whether it is the underlying semantics of visual objects or words that is captured. In this paper we draw on the established Definition Modeling paradigm and enhance it by grounding, for the first time, textual definitions to visual representations. We name this new task Visual Definition Modeling and put forward DEMETER and DIONYSUS, two benchmarks where, given an image as context, models have to generate a textual definition for a target being either i) a word that describes the image, or ii) an object patch therein. To measure the difficulty of our tasks we finetuned six different baselines and analyzed their performances, which show that a text-only encoder-decoder model is more effective than models pretrained for handling inputs of both modalities concurrently. This demonstrates the complexity of our benchmarks and encourages more research on text generation conditioned on multimodal inputs. The datasets for both benchmarks are available at https://github.com/SapienzaNLP/visual-definition-modeling as well as the code to reproduce our models.

Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects / Scarlini, Bianca; Pasini, Tommaso; Navigli, Roberto. - 36:10(2022), pp. 11267-11275. (Intervento presentato al convegno National Conference of the American Association for Artificial Intelligence tenutosi a Online) [10.1609/aaai.v36i10.21377].

Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects

Scarlini, Bianca
;
Pasini, Tommaso
;
Navigli, Roberto
2022

Abstract

Architectures that model language and vision together havereceived much attention in recent years. Nonetheless, most tasks in this field focus on end-to-end applications without providing insights on whether it is the underlying semantics of visual objects or words that is captured. In this paper we draw on the established Definition Modeling paradigm and enhance it by grounding, for the first time, textual definitions to visual representations. We name this new task Visual Definition Modeling and put forward DEMETER and DIONYSUS, two benchmarks where, given an image as context, models have to generate a textual definition for a target being either i) a word that describes the image, or ii) an object patch therein. To measure the difficulty of our tasks we finetuned six different baselines and analyzed their performances, which show that a text-only encoder-decoder model is more effective than models pretrained for handling inputs of both modalities concurrently. This demonstrates the complexity of our benchmarks and encourages more research on text generation conditioned on multimodal inputs. The datasets for both benchmarks are available at https://github.com/SapienzaNLP/visual-definition-modeling as well as the code to reproduce our models.
2022
National Conference of the American Association for Artificial Intelligence
Visual Definition Modeling; Artificial intelligence; Modeling languages; Visual languages
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects / Scarlini, Bianca; Pasini, Tommaso; Navigli, Roberto. - 36:10(2022), pp. 11267-11275. (Intervento presentato al convegno National Conference of the American Association for Artificial Intelligence tenutosi a Online) [10.1609/aaai.v36i10.21377].
File allegati a questo prodotto
File Dimensione Formato  
Scarlini_Visual_2022.pdf

accesso aperto

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 11.77 MB
Formato Adobe PDF
11.77 MB Adobe PDF
Scarlini_Visual_2022.pdf

accesso aperto

Note: versione editoriale compressa
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 388.58 kB
Formato Adobe PDF
388.58 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1685063
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact