Catalogo dei prodotti della ricerca

CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. We show that image synthesis is nevertheless possible using CLIP alone—without a pre-trained generative decoder or CLIP tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. With CLIP frozen, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. Our findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight. Code: https://github.com/OmnAI-Lab/implicit-inversion

Implicit Inversion turns CLIP into a Decoder / D'Orazio, Antonio; Briglia, Maria Rosaria; Crisostomi, Donato; Loi, Dario; Rodola', Emanuele; Masi, Iacopo. - (2026). ( International Conference on Learning Representations (ICLR) Rio De Janeiro, Brazil ).

Implicit Inversion turns CLIP into a Decoder

Antonio D'Orazio;Maria Rosaria Briglia;Donato Crisostomi;Dario Loi;Emanuele Rodola';Iacopo Masi

2026

Abstract

CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. We show that image synthesis is nevertheless possible using CLIP alone—without a pre-trained generative decoder or CLIP tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. With CLIP frozen, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. Our findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight. Code: https://github.com/OmnAI-Lab/implicit-inversion

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2026
			
	Nome convegno
	
				International Conference on Learning Representations (ICLR)
			
	Parole chiave
	
				optimization, CLIP, deep learning, neural inversion
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Implicit Inversion turns CLIP into a Decoder / D'Orazio, Antonio; Briglia, Maria Rosaria; Crisostomi, Donato; Loi, Dario; Rodola', Emanuele; Masi, Iacopo. - (2026). ( International Conference on Learning Representations (ICLR) Rio De Janeiro, Brazil ).

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1763254

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact