Super resolution aims to obtain a high-resolution image from a low-resolution one. This is an extremely difficult task because, starting from the same low-resolution image, it is possible to obtain several images that are all equally acceptable. Diffusion Models represent a huge step forward in solving this task, thanks to their strong generative priors. Furthermore, since their operating principle is based on generating an image by removing noise, they are ideal candidates for removing all the degradation that characterizes low-resolution images. However, super-resolution models based on the diffusion process have two major disadvantages, they require a large number of denoising steps, which leads to increased computational costs, and they rely on text conditioning derived from semantic captioning as a denoising guidance. This textual description provides only high-level semantics and lacks spatially aligned visual information. To solve these problems, this thesis proposes GramSR, a one-step diffusion-based super resolution framework that replaces text conditioning with dense visual features extracted from a visual backbone. In the proposed architecture, only the denoising U-Net is trained by adopting a three-stage LoRA approach that allows significant memory savings during training. These LoRA modules are trained sequentially and each of them has a different training objective. The first LoRA is trained to handle degradation removal via a simple pixel loss, the second LoRA is responsible of the semantic enhancement, and the last one enforces texture alignment via feature correlation through a Gram matrix loss computed from DINOv3 features. Through extensive experiments on standard super resolution benchmarks, this work proves that conditioning via dense visual features is able to effectively guide the denoising process and that disentangling degradation removal, semantic enhancement and texture alignment allow to achieve superior structural fidelity and texture realism.
GramSR: Visual Conditioning for Diffusion-Based Super-Resolution
D'ORONZIO, FABIO
2024/2025
Abstract
Super resolution aims to obtain a high-resolution image from a low-resolution one. This is an extremely difficult task because, starting from the same low-resolution image, it is possible to obtain several images that are all equally acceptable. Diffusion Models represent a huge step forward in solving this task, thanks to their strong generative priors. Furthermore, since their operating principle is based on generating an image by removing noise, they are ideal candidates for removing all the degradation that characterizes low-resolution images. However, super-resolution models based on the diffusion process have two major disadvantages, they require a large number of denoising steps, which leads to increased computational costs, and they rely on text conditioning derived from semantic captioning as a denoising guidance. This textual description provides only high-level semantics and lacks spatially aligned visual information. To solve these problems, this thesis proposes GramSR, a one-step diffusion-based super resolution framework that replaces text conditioning with dense visual features extracted from a visual backbone. In the proposed architecture, only the denoising U-Net is trained by adopting a three-stage LoRA approach that allows significant memory savings during training. These LoRA modules are trained sequentially and each of them has a different training objective. The first LoRA is trained to handle degradation removal via a simple pixel loss, the second LoRA is responsible of the semantic enhancement, and the last one enforces texture alignment via feature correlation through a Gram matrix loss computed from DINOv3 features. Through extensive experiments on standard super resolution benchmarks, this work proves that conditioning via dense visual features is able to effectively guide the denoising process and that disentangling degradation removal, semantic enhancement and texture alignment allow to achieve superior structural fidelity and texture realism.| File | Dimensione | Formato | |
|---|---|---|---|
|
D'Oronzio.Fabio.pdf
Accesso riservato
Dimensione
51.49 MB
Formato
Adobe PDF
|
51.49 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/5712