Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation

Rossetti, S.; Zappia, D.; Sanzari, M.; Schaerf, M.; Pirri, F.

doi:10.1007/978-3-031-20056-4_26

Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve 69.3% mIoU on PascalVOC 2012 val set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation / Rossetti, S.; Zappia, D.; Sanzari, M.; Schaerf, M.; Pirri, F.. - 13690 LNCS:(2022), pp. 446-463. (Intervento presentato al convegno European Conference on Computer Vision tenutosi a Tel Aviv, Israel) [10.1007/978-3-031-20056-4_26].

Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation

Rossetti S.^Primo;Zappia D.;Sanzari M.;Schaerf M.;Pirri F.

2022

Abstract

Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve 69.3% mIoU on PascalVOC 2012 val set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2022
			
	Nome convegno
	
				European Conference on Computer Vision
			
	Parole chiave
	
				weakly-supervised semantic segmentation; vision transformers; global max pooling; image class-labels supervision
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation / Rossetti, S.; Zappia, D.; Sanzari, M.; Schaerf, M.; Pirri, F.. - 13690 LNCS:(2022), pp. 446-463. (Intervento presentato al  convegno European Conference on Computer Vision tenutosi a Tel Aviv, Israel) [10.1007/978-3-031-20056-4_26].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Rossetti_preprint_Max_2022.pdf accesso aperto Note: https://link.springer.com/chapter/10.1007/978-3-031-20056-4_26 Tipologia: Documento in Pre-print (manoscritto inviato all'editore, precedente alla peer review) Licenza: Creative commons Dimensione 8.16 MB Formato Adobe PDF	8.16 MB	Adobe PDF
Rossetti_preprint_Max_2022_Supplementary.pdf accesso aperto Note: Materiale supplementario Tipologia: Documento in Pre-print (manoscritto inviato all'editore, precedente alla peer review) Licenza: Creative commons Dimensione 1.47 MB Formato Adobe PDF	1.47 MB	Adobe PDF
Rossetti_Max_2022.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 5.41 MB Formato Adobe PDF Contatta l'autore	5.41 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1664986

Citazioni

ND

35

38

Catalogo dei prodotti della ricerca