The industrial sector relies on extensive technical documentation to support manufacturing processes, maintenance, and troubleshooting. These manuals often include diverse input formats such as text, images, tables, graphs, technical drawings, and complex tabular data, making their interpretation a challenge for both human operators and AI systems. Efficiently extracting and processing information from these heterogeneous sources is crucial for improving decision-making and operational efficiency in industrial environments. This study evaluates the ability of AI models to process multimodal industrial documentation through two distinct Retrieval Augmented Generation (RAG) approaches. The first approach employs Large Language Models (LLMs) to generate textual descriptions of full manual pages containing various input types. We assess whether these descriptions preserve all relevant details or introduce information loss. The second approach leverages the Vision-Language Model (VLM) to directly interpret and answer questions based on multimodal content. To benchmark these systems, we compiled a dataset of 90 frequently asked questions (FAQs) derived from a real-world industrial manual, specifically a set-up manual of an industrial 3D printer. The evaluation is based on two key criteria: (1) for the LLM-based system, whether the descriptions retain complete information; and (2) for both systems, whether the generated responses are factually correct, as assessed by expert evaluation. Our findings provide insights into the applicability of LLMs and VLMs in industrial settings, highlighting their strengths and limitations in processing complex technical documentation. While LLMs rely on textual reconstruction, potentially causing data loss, VLMs can process visual content but may struggle with highly technical details. This study contributes to the development of AI-powered documentation analysis, emphasizing the need for multimodal AI approaches tailored to industrial applications.
What Does It Say? A comparative study of Large Language Models and Vision-Language Models in Retrieval Augmented Generation for the analysis of manufacturing technical documentation / Proietti, S.; Sabetta, N.; Rosi, M.; Fiocco, E.; Colabianchi, S.; Cesarotti, V.. - In: ...SUMMER SCHOOL FRANCESCO TURCO. PROCEEDINGS. - ISSN 2283-8996. - (2025). ( 30th Summer School Francesco Turco, 2025 Lecce, Italy ).
What Does It Say? A comparative study of Large Language Models and Vision-Language Models in Retrieval Augmented Generation for the analysis of manufacturing technical documentation
Sabetta N.Writing – Original Draft Preparation
;Colabianchi S.;
2025
Abstract
The industrial sector relies on extensive technical documentation to support manufacturing processes, maintenance, and troubleshooting. These manuals often include diverse input formats such as text, images, tables, graphs, technical drawings, and complex tabular data, making their interpretation a challenge for both human operators and AI systems. Efficiently extracting and processing information from these heterogeneous sources is crucial for improving decision-making and operational efficiency in industrial environments. This study evaluates the ability of AI models to process multimodal industrial documentation through two distinct Retrieval Augmented Generation (RAG) approaches. The first approach employs Large Language Models (LLMs) to generate textual descriptions of full manual pages containing various input types. We assess whether these descriptions preserve all relevant details or introduce information loss. The second approach leverages the Vision-Language Model (VLM) to directly interpret and answer questions based on multimodal content. To benchmark these systems, we compiled a dataset of 90 frequently asked questions (FAQs) derived from a real-world industrial manual, specifically a set-up manual of an industrial 3D printer. The evaluation is based on two key criteria: (1) for the LLM-based system, whether the descriptions retain complete information; and (2) for both systems, whether the generated responses are factually correct, as assessed by expert evaluation. Our findings provide insights into the applicability of LLMs and VLMs in industrial settings, highlighting their strengths and limitations in processing complex technical documentation. While LLMs rely on textual reconstruction, potentially causing data loss, VLMs can process visual content but may struggle with highly technical details. This study contributes to the development of AI-powered documentation analysis, emphasizing the need for multimodal AI approaches tailored to industrial applications.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


