In this thesis is described the reasearch undertaken for the Ph.D. project in Computer Vision, having the main objective to tackle human activity recognition from RGB videos. Human activity recognition from videos aims to recognize which human activities are taking place during a video, considering only cues directly extracted from video frames. The related applications are manifold: healthcare monitoring applications, such as rehabilitation or stress monitoring, monitoring and surveillance for indoor and outdoor activities, human-machine interaction, entertainment etc.. An important disambiguation has to be exposed before proceeding further: the one between action and activity. Actions are generally described in literature as single person movements that may be composed of multiple simple gestures organized temporally, such as walking, waving or and punching. Gestures are instead elementary movements of a body part. On the other hand, activities are described as involving two or more persons and/or objects, or a single person performing complex actions, i.e. a sequence of actions. Human activity recognition is one of the main subjects of study of computer vision and machine learning communities since a long time, and it is still an hot topic due to its complexity. A challenging task is to develop a system for human activity recognition, due to well-known computer vision problems. Body parts occlusions, light conditions, and image resolution are only a subset of this problems. Furthermore, similitudes between activity classes make the problem even harder. Activities in the same class may be exhibited by distinct persons with distinct human body movements, and activities in different classes may be hard to discriminate because they may be constituted by analogous information. The way in which humans execute an activity depends on their habits, and this drives the challenge of detecting activities quite difficult. The main consideration coming out deeply analyzing the available literature for activity recognition, is that an activity recognition robust system has to be context-aware. Namely, not only the human motion is important to achieve good performances, but also other relevant cues which can be extracted from videos have to be considered. The available state of the art research in computer vision still misses a complete framework for human activity recognition based on context, taking into account both the scene where activities are taking place, objects analysis, 3D human motion analysis and interdependence between activity classes. This thesis describes computer vision frameworks which will enable the robust recognition of human activities explicitly considering the scene context. In this thesis are described the main contributions for context-aware activity recognition regarding 3D modeling of articulated and complex objects, 3D human pose estimation from single images and a method for activity recognition based on human motion primitives. Four major publications will be presented, together with an extensive literature review concerning computer vision areas such as 3D object modeling, 3D human pose estimation, human action recognition, human action recognition based on action and motion primitives and human activity recognition based on context. Future work concerning the undertaken research will be to build a complete system for activity recognition based on context, exploiting the several frameworks introduced so far.

Towards an understanding of human activities: from the skeleton to the space / Sanzari, Marta. - (2019 Sep 09).

Towards an understanding of human activities: from the skeleton to the space

SANZARI, MARTA
09/09/2019

Abstract

In this thesis is described the reasearch undertaken for the Ph.D. project in Computer Vision, having the main objective to tackle human activity recognition from RGB videos. Human activity recognition from videos aims to recognize which human activities are taking place during a video, considering only cues directly extracted from video frames. The related applications are manifold: healthcare monitoring applications, such as rehabilitation or stress monitoring, monitoring and surveillance for indoor and outdoor activities, human-machine interaction, entertainment etc.. An important disambiguation has to be exposed before proceeding further: the one between action and activity. Actions are generally described in literature as single person movements that may be composed of multiple simple gestures organized temporally, such as walking, waving or and punching. Gestures are instead elementary movements of a body part. On the other hand, activities are described as involving two or more persons and/or objects, or a single person performing complex actions, i.e. a sequence of actions. Human activity recognition is one of the main subjects of study of computer vision and machine learning communities since a long time, and it is still an hot topic due to its complexity. A challenging task is to develop a system for human activity recognition, due to well-known computer vision problems. Body parts occlusions, light conditions, and image resolution are only a subset of this problems. Furthermore, similitudes between activity classes make the problem even harder. Activities in the same class may be exhibited by distinct persons with distinct human body movements, and activities in different classes may be hard to discriminate because they may be constituted by analogous information. The way in which humans execute an activity depends on their habits, and this drives the challenge of detecting activities quite difficult. The main consideration coming out deeply analyzing the available literature for activity recognition, is that an activity recognition robust system has to be context-aware. Namely, not only the human motion is important to achieve good performances, but also other relevant cues which can be extracted from videos have to be considered. The available state of the art research in computer vision still misses a complete framework for human activity recognition based on context, taking into account both the scene where activities are taking place, objects analysis, 3D human motion analysis and interdependence between activity classes. This thesis describes computer vision frameworks which will enable the robust recognition of human activities explicitly considering the scene context. In this thesis are described the main contributions for context-aware activity recognition regarding 3D modeling of articulated and complex objects, 3D human pose estimation from single images and a method for activity recognition based on human motion primitives. Four major publications will be presented, together with an extensive literature review concerning computer vision areas such as 3D object modeling, 3D human pose estimation, human action recognition, human action recognition based on action and motion primitives and human activity recognition based on context. Future work concerning the undertaken research will be to build a complete system for activity recognition based on context, exploiting the several frameworks introduced so far.
9-set-2019
File allegati a questo prodotto
File Dimensione Formato  
Tesi_dottorato_Sanzari.pdf

accesso aperto

Tipologia: Tesi di dottorato
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 33.65 MB
Formato Adobe PDF
33.65 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1425638
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact