Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

Multi-agent planning using visual language models / Brienza, Michele; Argenziano, Francesco; Suriani, Vincenzo; Bloisi, Domenico D.; Nardi, Daniele. - (2024), pp. 3605-3611. (Intervento presentato al convegno European Conference on Artificial Intelligence tenutosi a Santiago de Compostela; Spain) [10.3233/FAIA240916].

Multi-agent planning using visual language models

Michele Brienza
Primo
;
Francesco Argenziano
Secondo
;
Daniele Nardi
Ultimo
2024

Abstract

Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.
2024
European Conference on Artificial Intelligence
multi-agent planning; visual language models; performance metric
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Multi-agent planning using visual language models / Brienza, Michele; Argenziano, Francesco; Suriani, Vincenzo; Bloisi, Domenico D.; Nardi, Daniele. - (2024), pp. 3605-3611. (Intervento presentato al convegno European Conference on Artificial Intelligence tenutosi a Santiago de Compostela; Spain) [10.3233/FAIA240916].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1725810
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact