Predictive perception for detecting human motion anomalies and procedural mistakes

D'AMELY DI MELENDUGNO, GUIDO MARIA

Computer Vision emerges as a cornerstone field within Artificial intelligence, enabling digital systems to sense the world through images, mirroring the human ability to see and interpret their surroundings. This ability is paramount, as it allows autonomous systems to interact with humans, promising to reliably extend the applications of AI to productive systems. For example, in Human-Robot collaboration (HRC), accurate vision-based techniques can prevent accidents by providing the cobot with the ability to interpret and swiftly respond to human worker actions. Similarly, in smart manufacturing, Computer Vision methods allow for the timely detection of errors and anomalies in production lines, enhancing quality control and safety, or in video surveillance, where they monitor environments for security threats, promptly identifying unusual behaviors or hazardous situations before they exacerbate. However, the deployment of Computer Vision technologies in real-world scenarios is hampered by significant challenges. % realizing the full potential of Computer Vision in practical settings is constrained by critical issues, including These include the requirement for real-time responsiveness, the ability to function reliably in diverse and unpredictable environments, and the development of comprehensive metrics for assessing detection accuracy and system reliability. This thesis explores machine perception's role in enhancing safety and productive integrity across several domains. By leveraging cutting-edge methodologies such as Denoising Diffusion Probabilistic Models and Large Language Models in novel domains, we propose innovative solutions for applications that require a fine understanding of human behaviors and environments to promote effectiveness, safety, and efficiency. First, we delve into the HRC domain. % We exploit human pose data to develop a method for preventing dangerous collisions in HRC. Aiming to improve the current methods' efficiency, we devise a lightweight Separable-Sparse Graph Convolutional model that we dub \emph{SeS-GCN}. SeS-GCN bottlenecks the interaction of the GCN's spatial, temporal, and channel-wise dimensions and further learns sparse adjacency matrices by a teacher-student framework. These modeling choices lower the model's memory footprint, providing a practical solution that proves effective both in Human-Pose Forecasting and Collision Avoidance. Moreover, the Cobots and Humans in Industrial COllaboration (CHICO) dataset is proposed to foster research in this field. For the first time, CHICO encompasses 3D-synchronized views and recorded poses of humans and cobots while collaborating in a real industrial scenario, representing a precious resource for advancing safe human-robot collaboration. Safety often coincides with promptly detecting and responding to mistakes or anomalies, which risk otherwise aggravating, potentially producing dangerous collisions or productive inefficiencies. Thus, following a review of the latest advancements in Video Anomaly Detection methodologies, this thesis builds on the established one-class classification framework, proposing two techniques for human-related Anomaly Detection. The first study investigates adopting non-Euclidean latent spaces to set the one-class-classification's metric objective. We leverage the unique properties of the hyperbolic and spherical metric manifolds for improving human-related anomaly detection. The second proposal introduces a Motion Conditioned Diffusion-based approach for Anomaly Detection (\emph{MoCoDAD}). Indeed, for the first time, MoCoDAD introduces a method for video anomaly detection that exploits cutting-edge diffusive models for spotting anomalies in motion sequences. We review the common reconstruction-based technique, coupling it with the generative ability of diffusion probabilistic models, extending the state-of-the-art in human-related Video Anomaly Detection, and providing relevant insights that serve as the foundation for online mistake detection. Next, this thesis deals with error anticipation in procedural activities. Acknowledging the absence of a proper benchmark for this task, we apply the insights from the one-class-classification paradigm and Video Anomaly Detection and propose two novel datasets, metrics, and baseline methods for detecting errors in industrial procedural videos. Moreover, we present an innovative technique that exploits the emerging reasoning capabilities of Large Language Models to detect mistakes in procedural video sequences. This results in a novel multimodal approach that leverages an action recognition module to classify the steps of Egocentric procedural videos and couple it with a Language model to analyze the obtained procedural transcripts and detect mistakes. This work offers empirical validation through extensive testing on established and newly introduced datasets; bridging the gap between Video Anomaly Detection and Procedural Mistake Detection, it presents a robust foundation for future research and practical applications. We advance the understanding of procedural mistakes as open-set phenomena and emphasize the crucial need for online detection mechanisms, thus enhancing safety and operational efficiency in these environments. These findings lay the foundation for future research, shaping the development of safer, more adaptive industrial automatic systems.

Predictive perception for detecting human motion anomalies and procedural mistakes / D'AMELY DI MELENDUGNO, GUIDO MARIA. - (2024 May 28).