Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset

Ghonim, Karim; Bejgu, Andrei Stefan; Fernández-Castro, Alberte; Navigli, Roberto

doi:10.18653/v1/2025.emnlp-main.1745

Vision-language Models (VLMs), such as CLIP and SigLIP, have become the de facto standard for multimodal tasks, serving as essential building blocks for recent Multimodal Large Language Models, including LLaVA and PaliGemma. However, current evaluations for VLMs remain heavily anchored to ImageNet. In this paper, we question whether ImageNet’s coverage is still sufficiently challenging for modern VLMs, and investigate the impact of adding novel and varied concept categories, i.e., semantically grouped fine-grained synsets. To this end, we introduce Concept-pedia, a novel, large-scale, semantically-annotated multimodal resource covering more than 165,000 concepts. Leveraging a language-agnostic, automatic annotation pipeline grounded in Wikipedia, Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated Visual Concept Recognition evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories. Our experiments show that current models, although excelling on ImageNet, struggle with Concept-10k. Not only do these findings highlight a persistent bias toward ImageNet-centric concepts, but they also underscore the urgent need for more representative benchmarks. By offering a broader and semantically richer testbed, Concept-10k aims to support the development of multimodal systems that better generalize to the complexities of real-world visual concepts

Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset / Ghonim, Karim; Bejgu, Andrei Stefan; Fernández-Castro, Alberte; Navigli, Roberto. - (2025), pp. 34405-34426. (Intervento presentato al convegno 30th Annual Conference on Empirical Methods in Natural Language Processing, EMNLP tenutosi a Suzhuo, China) [10.18653/v1/2025.emnlp-main.1745].

Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset

Ghonim, Karim;Bejgu, Andrei Stefan;Fernández-Castro, Alberte;Navigli, Roberto

2025

Abstract

Vision-language Models (VLMs), such as CLIP and SigLIP, have become the de facto standard for multimodal tasks, serving as essential building blocks for recent Multimodal Large Language Models, including LLaVA and PaliGemma. However, current evaluations for VLMs remain heavily anchored to ImageNet. In this paper, we question whether ImageNet’s coverage is still sufficiently challenging for modern VLMs, and investigate the impact of adding novel and varied concept categories, i.e., semantically grouped fine-grained synsets. To this end, we introduce Concept-pedia, a novel, large-scale, semantically-annotated multimodal resource covering more than 165,000 concepts. Leveraging a language-agnostic, automatic annotation pipeline grounded in Wikipedia, Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated Visual Concept Recognition evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories. Our experiments show that current models, although excelling on ImageNet, struggle with Concept-10k. Not only do these findings highlight a persistent bias toward ImageNet-centric concepts, but they also underscore the urgent need for more representative benchmarks. By offering a broader and semantically richer testbed, Concept-10k aims to support the development of multimodal systems that better generalize to the complexities of real-world visual concepts

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				30th Annual Conference on Empirical Methods in Natural Language Processing, EMNLP
			
	Parole chiave
	
				Multimodality, VLM Evaluation, Large Resource Creation
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset / Ghonim, Karim; Bejgu, Andrei Stefan; Fernández-Castro, Alberte; Navigli, Roberto. - (2025), pp. 34405-34426. (Intervento presentato al  convegno 30th Annual Conference on Empirical Methods in Natural Language Processing, EMNLP tenutosi a Suzhuo, China) [10.18653/v1/2025.emnlp-main.1745].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1755854

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

Catalogo dei prodotti della ricerca