Identifying instruments in polyphonic audio is challenging due to overlapping spectra and variations in timbre and playing styles. This task is central to music information retrieval, with applications in transcription, recommendation, and indexing. We propose a dual-branch Convolutional Neural Network (CNN) that processes Mel-spectrograms and binary pitch masks, fused through a cross-attention mechanism to emphasize pitch-salient regions. On the IRMAS dataset, the model achieves competitive performance with state-of-the-art methods, reaching a micro F1 of 0.64 and a macro F1 of 0.57 with only 0.878M parameters. Ablation studies and t-SNE analyses further highlight the benefits of cross-modal attention for robust predominant instrument recognition.

Cross-Attentive CNNs for Joint Specral and Pitch Feature Learning in Predominant Instrument Recognition from Polyphonic Music / Reghunath, L. C.; Rajan, R.; Napoli, C.; Randieri, C.. - In: TECHNOLOGIES. - ISSN 2227-7080. - 14:1(2026). [10.3390/technologies14010003]

Cross-Attentive CNNs for Joint Specral and Pitch Feature Learning in Predominant Instrument Recognition from Polyphonic Music

Napoli C.
;
Randieri C.
2026

Abstract

Identifying instruments in polyphonic audio is challenging due to overlapping spectra and variations in timbre and playing styles. This task is central to music information retrieval, with applications in transcription, recommendation, and indexing. We propose a dual-branch Convolutional Neural Network (CNN) that processes Mel-spectrograms and binary pitch masks, fused through a cross-attention mechanism to emphasize pitch-salient regions. On the IRMAS dataset, the model achieves competitive performance with state-of-the-art methods, reaching a micro F1 of 0.64 and a macro F1 of 0.57 with only 0.878M parameters. Ablation studies and t-SNE analyses further highlight the benefits of cross-modal attention for robust predominant instrument recognition.
2026
cross-attention mechanism; cross-modal learning; dual-branch convolutional neural network (CNN); mel-spectrogram representation; pitch-aware audio processing; polyphonic music analysis; predominant instrument recognition; spectral–pitch feature fusion
01 Pubblicazione su rivista::01a Articolo in rivista
Cross-Attentive CNNs for Joint Specral and Pitch Feature Learning in Predominant Instrument Recognition from Polyphonic Music / Reghunath, L. C.; Rajan, R.; Napoli, C.; Randieri, C.. - In: TECHNOLOGIES. - ISSN 2227-7080. - 14:1(2026). [10.3390/technologies14010003]
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1765417
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact