The generation of realistic synthetic thermal images is a critical challenge for vision systems operating in data-scarce environments. Thermal imaging is increasingly used in various applications, with particular importance in autonomous driving, as well as in surveillance, industrial inspection, search and rescue, and medical diagnostics. Unlike visible-spectrum imaging, thermal cameras capture infrared radiation emitted by objects, making them especially valuable in scenarios with poor illumination, adverse weather conditions, or en- vironments with smoke, fog, or dust. These properties make thermal imagery a crucial modality for developing robust perception systems in real-world scenarios where conven- tional RGB imaging may fail. Despite these advantages, the adoption of thermal imaging in machine learning pipelines is hindered by the scarcity of large, annotated datasets. This limitation motivates the ex- ploration of synthetic data generation to compensate for the lack of data. This work explores generative methods conditioned on spatial priors to synthesize realistic thermal images without requiring pixel-level alignment between thermal and RGB images, en- hancing their applicability in real-world scenarios where such alignment is often difficult to obtain. The main contribution of this work is a novel pipeline for synthetic thermal image gen- eration based on ControlNet, which outperforms existing methods in both visual fidelity and downstream object detection performance. Built on diffusion models, ControlNet en- ables precise and controllable image synthesis through explicit spatial conditioning. In this pipeline, edge maps extracted from segmentation masks produced by the Segment Any- thing Model (SAM) are used to guide generation, allowing the output images to maintain structural coherence and semantic consistency with the source content. The quality of the generated images was assessed both qualitatively and quantitatively, comparing the proposed ControlNet pipeline with two state-of-the-art baselines: an edge- guided GAN and a two-stage diffusion approach (ECDM). ControlNet demonstrated clear improvements in perceptual quality and distributional metrics compared to baseline meth- ods. Notably, it achieved a KID score of 0.0106, approximately 74% lower than the best baseline score, indicating a significantly closer statistical alignment with the distribution of real thermal images. Furthermore, the synthetic images generated with ControlNet were used to construct thermal datasets for training object detectors. Results show that a detector trained on ControlNet-generated data achieves a mean mAP@50 improvement of approximately 14% and a mean mAP@50:95 improvement of about 9% compared to detectors trained on datasets generated by the best baseline method, confirming the superior effectiveness of the proposed approach in real downstream perception tasks.
ControlNet for Unpaired RGB-to-Thermal Image Translation Using Edge-Based Guidance
CORRADI, LORENZO
2024/2025
Abstract
The generation of realistic synthetic thermal images is a critical challenge for vision systems operating in data-scarce environments. Thermal imaging is increasingly used in various applications, with particular importance in autonomous driving, as well as in surveillance, industrial inspection, search and rescue, and medical diagnostics. Unlike visible-spectrum imaging, thermal cameras capture infrared radiation emitted by objects, making them especially valuable in scenarios with poor illumination, adverse weather conditions, or en- vironments with smoke, fog, or dust. These properties make thermal imagery a crucial modality for developing robust perception systems in real-world scenarios where conven- tional RGB imaging may fail. Despite these advantages, the adoption of thermal imaging in machine learning pipelines is hindered by the scarcity of large, annotated datasets. This limitation motivates the ex- ploration of synthetic data generation to compensate for the lack of data. This work explores generative methods conditioned on spatial priors to synthesize realistic thermal images without requiring pixel-level alignment between thermal and RGB images, en- hancing their applicability in real-world scenarios where such alignment is often difficult to obtain. The main contribution of this work is a novel pipeline for synthetic thermal image gen- eration based on ControlNet, which outperforms existing methods in both visual fidelity and downstream object detection performance. Built on diffusion models, ControlNet en- ables precise and controllable image synthesis through explicit spatial conditioning. In this pipeline, edge maps extracted from segmentation masks produced by the Segment Any- thing Model (SAM) are used to guide generation, allowing the output images to maintain structural coherence and semantic consistency with the source content. The quality of the generated images was assessed both qualitatively and quantitatively, comparing the proposed ControlNet pipeline with two state-of-the-art baselines: an edge- guided GAN and a two-stage diffusion approach (ECDM). ControlNet demonstrated clear improvements in perceptual quality and distributional metrics compared to baseline meth- ods. Notably, it achieved a KID score of 0.0106, approximately 74% lower than the best baseline score, indicating a significantly closer statistical alignment with the distribution of real thermal images. Furthermore, the synthetic images generated with ControlNet were used to construct thermal datasets for training object detectors. Results show that a detector trained on ControlNet-generated data achieves a mean mAP@50 improvement of approximately 14% and a mean mAP@50:95 improvement of about 9% compared to detectors trained on datasets generated by the best baseline method, confirming the superior effectiveness of the proposed approach in real downstream perception tasks.| File | Dimensione | Formato | |
|---|---|---|---|
|
Corradi.Lorenzo.pdf
accesso aperto
Dimensione
13.22 MB
Formato
Adobe PDF
|
13.22 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/3416