Crowd counting is a challenging and relevant computer vision task. Most of the existing methods are image-based, i.e., they only exploit the spatial information of a single image to estimate the corresponding people count. Recently, video-based methods have been proposed to improve counting accuracy by also exploiting temporal information coming from the correlation between adjacent frames. In this work, we point out the need to properly evaluate the temporal information's specific contribution over the spatial one. This issue has not been discussed by existing work, and in some cases such evaluation has been carried out in a way that may lead to overestimating the contribution of the temporal information. To address this issue we propose a categorisation of existing video-based models, discuss how the contribution of the temporal information has been evaluated by existing work, and propose an evaluation approach aimed at providing a more complete evaluation for two different categories of video-based methods. We finally illustrate our approach, for a specific category, through experiments on several benchmark video data sets.
On the Evaluation of Video-Based Crowd Counting Models / Ledda, E.; Putzu, L.; Delussu, R.; Fumera, G.; Roli, F.. - 13233 LNCS:(2022), pp. 301-311. (Intervento presentato al convegno 21st International Conference on Image Analysis and Processing tenutosi a Lecce; Italy) [10.1007/978-3-031-06433-3_26].