CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. We show that image synthesis is nevertheless possible using CLIP alone—without a pre-trained generative decoder or CLIP tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. With CLIP frozen, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. Our findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight. Code: https://github.com/OmnAI-Lab/implicit-inversion

Implicit Inversion turns CLIP into a Decoder / D'Orazio, Antonio; Briglia, Maria Rosaria; Crisostomi, Donato; Loi, Dario; Rodola', Emanuele; Masi, Iacopo. - (2026). ( International Conference on Learning Representations (ICLR) Rio De Janeiro, Brazil ).

Implicit Inversion turns CLIP into a Decoder

Antonio D'Orazio;Maria Rosaria Briglia;Donato Crisostomi;Dario Loi;Emanuele Rodola';Iacopo Masi
2026

Abstract

CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. We show that image synthesis is nevertheless possible using CLIP alone—without a pre-trained generative decoder or CLIP tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. With CLIP frozen, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. Our findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight. Code: https://github.com/OmnAI-Lab/implicit-inversion
2026
International Conference on Learning Representations (ICLR)
optimization, CLIP, deep learning, neural inversion
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Implicit Inversion turns CLIP into a Decoder / D'Orazio, Antonio; Briglia, Maria Rosaria; Crisostomi, Donato; Loi, Dario; Rodola', Emanuele; Masi, Iacopo. - (2026). ( International Conference on Learning Representations (ICLR) Rio De Janeiro, Brazil ).
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1763254
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact