Catalogo dei prodotti della ricerca

The industrial sector relies on extensive technical documentation to support manufacturing processes, maintenance, and troubleshooting. These manuals often include diverse input formats such as text, images, tables, graphs, technical drawings, and complex tabular data, making their interpretation a challenge for both human operators and AI systems. Efficiently extracting and processing information from these heterogeneous sources is crucial for improving decision-making and operational efficiency in industrial environments. This study evaluates the ability of AI models to process multimodal industrial documentation through two distinct Retrieval Augmented Generation (RAG) approaches. The first approach employs Large Language Models (LLMs) to generate textual descriptions of full manual pages containing various input types. We assess whether these descriptions preserve all relevant details or introduce information loss. The second approach leverages the Vision-Language Model (VLM) to directly interpret and answer questions based on multimodal content. To benchmark these systems, we compiled a dataset of 90 frequently asked questions (FAQs) derived from a real-world industrial manual, specifically a set-up manual of an industrial 3D printer. The evaluation is based on two key criteria: (1) for the LLM-based system, whether the descriptions retain complete information; and (2) for both systems, whether the generated responses are factually correct, as assessed by expert evaluation. Our findings provide insights into the applicability of LLMs and VLMs in industrial settings, highlighting their strengths and limitations in processing complex technical documentation. While LLMs rely on textual reconstruction, potentially causing data loss, VLMs can process visual content but may struggle with highly technical details. This study contributes to the development of AI-powered documentation analysis, emphasizing the need for multimodal AI approaches tailored to industrial applications.

What Does It Say? A comparative study of Large Language Models and Vision-Language Models in Retrieval Augmented Generation for the analysis of manufacturing technical documentation / Proietti, S.; Sabetta, N.; Rosi, M.; Fiocco, E.; Colabianchi, S.; Cesarotti, V.. - In: ...SUMMER SCHOOL FRANCESCO TURCO. PROCEEDINGS. - ISSN 2283-8996. - (2025). ( 30th Summer School Francesco Turco, 2025 Lecce, Italy ).

What Does It Say? A comparative study of Large Language Models and Vision-Language Models in Retrieval Augmented Generation for the analysis of manufacturing technical documentation

Sabetta N.^{Writing – Original Draft Preparation};Rosi M.;Fiocco E.;Colabianchi S.;Cesarotti V.

2025

Abstract

The industrial sector relies on extensive technical documentation to support manufacturing processes, maintenance, and troubleshooting. These manuals often include diverse input formats such as text, images, tables, graphs, technical drawings, and complex tabular data, making their interpretation a challenge for both human operators and AI systems. Efficiently extracting and processing information from these heterogeneous sources is crucial for improving decision-making and operational efficiency in industrial environments. This study evaluates the ability of AI models to process multimodal industrial documentation through two distinct Retrieval Augmented Generation (RAG) approaches. The first approach employs Large Language Models (LLMs) to generate textual descriptions of full manual pages containing various input types. We assess whether these descriptions preserve all relevant details or introduce information loss. The second approach leverages the Vision-Language Model (VLM) to directly interpret and answer questions based on multimodal content. To benchmark these systems, we compiled a dataset of 90 frequently asked questions (FAQs) derived from a real-world industrial manual, specifically a set-up manual of an industrial 3D printer. The evaluation is based on two key criteria: (1) for the LLM-based system, whether the descriptions retain complete information; and (2) for both systems, whether the generated responses are factually correct, as assessed by expert evaluation. Our findings provide insights into the applicability of LLMs and VLMs in industrial settings, highlighting their strengths and limitations in processing complex technical documentation. While LLMs rely on textual reconstruction, potentially causing data loss, VLMs can process visual content but may struggle with highly technical details. This study contributes to the development of AI-powered documentation analysis, emphasizing the need for multimodal AI approaches tailored to industrial applications.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				30th Summer School Francesco Turco, 2025
			
	Parole chiave
	
				Industry 5.0; Knowledge Management; Large Language Models; Smart Manufacturing; Vision-Language Models
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04c Atto di convegno in rivista
			
	Citazione
	
				What Does It Say? A comparative study of Large Language Models and Vision-Language Models in Retrieval Augmented Generation for the analysis of manufacturing technical documentation / Proietti, S.; Sabetta, N.; Rosi, M.; Fiocco, E.; Colabianchi, S.; Cesarotti, V.. - In: ...SUMMER SCHOOL FRANCESCO TURCO. PROCEEDINGS. - ISSN 2283-8996. - (2025). ( 30th Summer School Francesco Turco, 2025 Lecce, Italy ).

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1754546

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

social impact