Multi-agent planning using visual language models

Brienza, Michele; Argenziano, Francesco; Suriani, Vincenzo; Bloisi, Domenico D.; Nardi, Daniele

doi:10.3233/FAIA240916

Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

Multi-agent planning using visual language models / Brienza, Michele; Argenziano, Francesco; Suriani, Vincenzo; Bloisi, Domenico D.; Nardi, Daniele. - 392:(2024), pp. 3605-3611. (Intervento presentato al convegno European Conference on Artificial Intelligence tenutosi a Santiago de Compostela; Spain) [10.3233/FAIA240916].

Multi-agent planning using visual language models

Michele Brienza^Primo;Francesco Argenziano^Secondo;Vincenzo Suriani;Daniele Nardi^Ultimo

2024

Abstract

Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2024
			
	Nome convegno
	
				European Conference on Artificial Intelligence
			
	Parole chiave
	
				multi-agent planning; visual language models; performance metric
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Multi-agent planning using visual language models / Brienza, Michele; Argenziano, Francesco; Suriani, Vincenzo; Bloisi, Domenico D.; Nardi, Daniele. - 392:(2024), pp. 3605-3611. (Intervento presentato al  convegno European Conference on Artificial Intelligence tenutosi a Santiago de Compostela; Spain) [10.3233/FAIA240916].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Brienza_Multi-agent-planning_2024.pdf accesso aperto Note: https://ebooks.iospress.nl/pdf/doi/10.3233/FAIA240916 Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 1.09 MB Formato Adobe PDF	1.09 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1725810

Citazioni

ND

ND

ND

Catalogo dei prodotti della ricerca