Computational auditory scene analysis (CASA) focuses on the problem of building machines able to understand and interpret complex acoustic scenarios and react, after a brief period, in an opportune way. A complex acoustic scenario can be characterized by several sounds of various origin and nature coming from different sources. Consequently, one of the main challenges is the simultaneous elaboration of all this information with limited computational resources. Colin Cherry, in1953, investigated human behavior in the same circumstances, (which he called “cocktail party problem”). He performed several experiments proving that people are very efficient cocktail party solvers, making use of attentive mechanisms. Attentive mechanisms, in fact, allow the brain to focus on what it is necessary to follow and ignore what it is possible to discard. This selection procedure is driven by many factors; depending on the nature of these factors, it is possible to distinguish between a bottom-up and top-down perspective. In the first case, sounds of interest are those which stand out from the scene, without involving a real attentive processing, but just the pre-attentive one. In the second case, the goal, a particular task, the previous decisions and the acquired models guide the subjects' attention. The fusion of these two modalities suggests to the brain what is salient and what can be attenuated or deleted. In this thesis, we propose a top-down attention model and we carry out behavioral experiments –inspired by the Cherry’s ones- to investigate the role of top-down attention in the cocktail party. In particular, we model top-down attention as a sequential decision making process driven by a task – modeled as a classification problem - in an environment with random subsets of features missing, but where we have the possibility to gather additional features among the ones that are missing. Thus, the top-down attention problem is reduced to finding the answer to the question what to measure next? Attention is based on the top-down saliency of the missing features given as the estimated difference in classification confusion (entropy) with and without the given feature. The difference in confusion is computed conditioned on the available set of features. We also investigated missing data problem, comparing the efficiency of some missing data techniques and used the results to make our attention model more realistic by also allowing the initial training phase to take place with incomplete data. Moreover, we simulate the cocktail party problem in the model and make predictions about sensitivity to confounders under different levels of attention. We finally examine the role of temporal and spectral overlaps for human speech intelligibility, and how the presence of a task influences it. We also investigated multi-modal human-robot interaction and proposed a multimodal speaker identification system, combining acoustic and visual features to identify and track people taking part in a conversation.

Top-Down Attention Modelling in a Cocktail Party Scenario(2012 Mar 26).

Top-Down Attention Modelling in a Cocktail Party Scenario

-
26/03/2012

Abstract

Computational auditory scene analysis (CASA) focuses on the problem of building machines able to understand and interpret complex acoustic scenarios and react, after a brief period, in an opportune way. A complex acoustic scenario can be characterized by several sounds of various origin and nature coming from different sources. Consequently, one of the main challenges is the simultaneous elaboration of all this information with limited computational resources. Colin Cherry, in1953, investigated human behavior in the same circumstances, (which he called “cocktail party problem”). He performed several experiments proving that people are very efficient cocktail party solvers, making use of attentive mechanisms. Attentive mechanisms, in fact, allow the brain to focus on what it is necessary to follow and ignore what it is possible to discard. This selection procedure is driven by many factors; depending on the nature of these factors, it is possible to distinguish between a bottom-up and top-down perspective. In the first case, sounds of interest are those which stand out from the scene, without involving a real attentive processing, but just the pre-attentive one. In the second case, the goal, a particular task, the previous decisions and the acquired models guide the subjects' attention. The fusion of these two modalities suggests to the brain what is salient and what can be attenuated or deleted. In this thesis, we propose a top-down attention model and we carry out behavioral experiments –inspired by the Cherry’s ones- to investigate the role of top-down attention in the cocktail party. In particular, we model top-down attention as a sequential decision making process driven by a task – modeled as a classification problem - in an environment with random subsets of features missing, but where we have the possibility to gather additional features among the ones that are missing. Thus, the top-down attention problem is reduced to finding the answer to the question what to measure next? Attention is based on the top-down saliency of the missing features given as the estimated difference in classification confusion (entropy) with and without the given feature. The difference in confusion is computed conditioned on the available set of features. We also investigated missing data problem, comparing the efficiency of some missing data techniques and used the results to make our attention model more realistic by also allowing the initial training phase to take place with incomplete data. Moreover, we simulate the cocktail party problem in the model and make predictions about sensitivity to confounders under different levels of attention. We finally examine the role of temporal and spectral overlaps for human speech intelligibility, and how the presence of a task influences it. We also investigated multi-modal human-robot interaction and proposed a multimodal speaker identification system, combining acoustic and visual features to identify and track people taking part in a conversation.
26-mar-2012
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/917987
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact