Many perception systems are increasingly deployed in human-centered and real-world settings where behavior must remain stable over time, despite noise, variability in acquisition conditions, and limited supervision. This thesis studies temporal representation learning for robust perception, focusing on how multivariate sequences are transformed into compact representations that preserve task-relevant temporal structure while remaining resilient to nuisance variability introduced by sensing conditions. Rather than treating temporal modeling as merely an architectural choice, the thesis adopts a pipeline view in which segmentation, preprocessing, augmentation, temporal encoding, and task-aligned evaluation jointly determine which evidence becomes learnable and how reliably performance reflects generalization. The thesis develops this perspective through three complementary case studies spanning different sensing modalities and inference regimes. First, a hand gesture recognition pipeline is presented in which hand motion is represented as a sequence of compact, geometry-aware descriptors extracted from skeletal joint dynamics, then modeled with recurrent sequence encoders to perform gesture classification. Second, an EEG-based deception detection pipeline is introduced for classifying truthful versus deceptive responses from noisy, non-stationary brain signals, combining band-pass filtering, overlapping window segmentation, class balancing, and noise-based augmentation with bidirectional temporal modeling for robust classification. Third, the thesis presents WhoFi, a Wi-Fi CSI person re-identification approach that learns identity-discriminative embeddings from radio-frequency signal sequences and performs similarity-based retrieval under environmental and recording variability. Across these settings, the thesis highlights recurring design tensions in temporal perception. In particular, it studies the trade-off between short segments that preserve fine temporal detail and longer contexts that capture slower trends; the role of preprocessing choices that can either stabilize learning or suppress informative signal variations; and the practical value of augmentation strategies that improve robustness by exposing models to plausible perturbations during training. The work therefore reports not only final performance, but also the effects of key design decisions such as window construction, normalization and filtering strategies, augmentation policies, and temporal encoder selection , using metrics that match the downstream decision rule. Overall, the thesis contributes a coherent, mechanism-driven framing of temporal representation learning that remains applicable across heterogeneous modalities, together with end-to-end pipelines for gesture recognition, EEG deception detection, and Wi-Fi CSI re-identification. By grounding the discussion in these three modalities, the thesis clarifies how reliable temporal inference depends on the combined design of input construction, processing assumptions, temporal encoding, and evaluation aligned with intended use. This integrated view supports both stronger methodological choices and more trustworthy deployment of temporal models in practical sensing scenarios.

Temporal Representation Learning for Robust Perception / Emam, Emad Ashraf Mohamed Kamel. - (2026 May 11).

Temporal Representation Learning for Robust Perception

EMAM, EMAD ASHRAF MOHAMED KAMEL
11/05/2026

Abstract

Many perception systems are increasingly deployed in human-centered and real-world settings where behavior must remain stable over time, despite noise, variability in acquisition conditions, and limited supervision. This thesis studies temporal representation learning for robust perception, focusing on how multivariate sequences are transformed into compact representations that preserve task-relevant temporal structure while remaining resilient to nuisance variability introduced by sensing conditions. Rather than treating temporal modeling as merely an architectural choice, the thesis adopts a pipeline view in which segmentation, preprocessing, augmentation, temporal encoding, and task-aligned evaluation jointly determine which evidence becomes learnable and how reliably performance reflects generalization. The thesis develops this perspective through three complementary case studies spanning different sensing modalities and inference regimes. First, a hand gesture recognition pipeline is presented in which hand motion is represented as a sequence of compact, geometry-aware descriptors extracted from skeletal joint dynamics, then modeled with recurrent sequence encoders to perform gesture classification. Second, an EEG-based deception detection pipeline is introduced for classifying truthful versus deceptive responses from noisy, non-stationary brain signals, combining band-pass filtering, overlapping window segmentation, class balancing, and noise-based augmentation with bidirectional temporal modeling for robust classification. Third, the thesis presents WhoFi, a Wi-Fi CSI person re-identification approach that learns identity-discriminative embeddings from radio-frequency signal sequences and performs similarity-based retrieval under environmental and recording variability. Across these settings, the thesis highlights recurring design tensions in temporal perception. In particular, it studies the trade-off between short segments that preserve fine temporal detail and longer contexts that capture slower trends; the role of preprocessing choices that can either stabilize learning or suppress informative signal variations; and the practical value of augmentation strategies that improve robustness by exposing models to plausible perturbations during training. The work therefore reports not only final performance, but also the effects of key design decisions such as window construction, normalization and filtering strategies, augmentation policies, and temporal encoder selection , using metrics that match the downstream decision rule. Overall, the thesis contributes a coherent, mechanism-driven framing of temporal representation learning that remains applicable across heterogeneous modalities, together with end-to-end pipelines for gesture recognition, EEG deception detection, and Wi-Fi CSI re-identification. By grounding the discussion in these three modalities, the thesis clarifies how reliable temporal inference depends on the combined design of input construction, processing assumptions, temporal encoding, and evaluation aligned with intended use. This integrated view supports both stronger methodological choices and more trustworthy deployment of temporal models in practical sensing scenarios.
11-mag-2026
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1768357
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact