GramSR: Visual Conditioning for Diffusion-Based Super-Resolution

Super resolution aims to obtain a high-resolution image from a low-resolution one. This is an extremely difficult task because, starting from the same low-resolution image, it is possible to obtain several images that are all equally acceptable. Diffusion Models represent a huge step forward in solving this task, thanks to their strong generative priors. Furthermore, since their operating principle is based on generating an image by removing noise, they are ideal candidates for removing all the degradation that characterizes low-resolution images. However, super-resolution models based on the diffusion process have two major disadvantages, they require a large number of denoising steps, which leads to increased computational costs, and they rely on text conditioning derived from semantic captioning as a denoising guidance. This textual description provides only high-level semantics and lacks spatially aligned visual information. To solve these problems, this thesis proposes GramSR, a one-step diffusion-based super resolution framework that replaces text conditioning with dense visual features extracted from a visual backbone. In the proposed architecture, only the denoising U-Net is trained by adopting a three-stage LoRA approach that allows significant memory savings during training. These LoRA modules are trained sequentially and each of them has a different training objective. The first LoRA is trained to handle degradation removal via a simple pixel loss, the second LoRA is responsible of the semantic enhancement, and the last one enforces texture alignment via feature correlation through a Gram matrix loss computed from DINOv3 features. Through extensive experiments on standard super resolution benchmarks, this work proves that conditioning via dense visual features is able to effectively guide the denoising process and that disentangling degradation removal, semantic enhancement and texture alignment allow to achieve superior structural fidelity and texture realism.

GramSR: Visual Conditioning for Diffusion-Based Super-Resolution

D'ORONZIO, FABIO

2024/2025

Abstract

Super resolution aims to obtain a high-resolution image from a low-resolution one. This is an extremely difficult task because, starting from the same low-resolution image, it is possible to obtain several images that are all equally acceptable. Diffusion Models represent a huge step forward in solving this task, thanks to their strong generative priors. Furthermore, since their operating principle is based on generating an image by removing noise, they are ideal candidates for removing all the degradation that characterizes low-resolution images. However, super-resolution models based on the diffusion process have two major disadvantages, they require a large number of denoising steps, which leads to increased computational costs, and they rely on text conditioning derived from semantic captioning as a denoising guidance. This textual description provides only high-level semantics and lacks spatially aligned visual information. To solve these problems, this thesis proposes GramSR, a one-step diffusion-based super resolution framework that replaces text conditioning with dense visual features extracted from a visual backbone. In the proposed architecture, only the denoising U-Net is trained by adopting a three-stage LoRA approach that allows significant memory savings during training. These LoRA modules are trained sequentially and each of them has a different training objective. The first LoRA is trained to handle degradation removal via a simple pixel loss, the second LoRA is responsible of the semantic enhancement, and the last one enforces texture alignment via feature correlation through a Gram matrix loss computed from DINOv3 features. Through extensive experiments on standard super resolution benchmarks, this work proves that conditioning via dense visual features is able to effectively guide the denoising process and that disentangling degradation removal, semantic enhancement and texture alignment allow to achieve superior structural fidelity and texture realism.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				Deep Learning
Super-Resolution
Diffusion Model
LoRA
Gram Matrix
			
	Relatore
	
				BARALDI, LORENZO
			
	Controrelatore
	
				ZINI, LEONARDO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
D'Oronzio.Fabio.pdf Accesso riservato Dimensione 51.49 MB Formato Adobe PDF	51.49 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/5712