LLM-based SVG image captioning with conceptual embeddings

Scalable Vector Graphics (SVG) are ubiquitous in modern web design, yet current Multimodal Large Language Models (MLLMs) struggle to interpret their raw XML structure. Most existing approaches rely on rasterizing SVGs into pixel grids, a process that discards the semantic richness of the vector definition and introduces resolution artifacts. In this work, I present a novel architecture that enables a decoder-only LLM to directly ingest and interpret SVG code without rasterization. I integrate a SVG Path Embedder (SPE) – originally developed by Zini et al. – that maps continuous geometric coordinates into the LLM’s embedding space using sinusoidal functions. By combining this encoder with Qwen2-7B and Gemma-9B via Low-Rank Adaptation (LoRA), I demonstrate that the model can learn to "read" vector geometries effectively. My experiments, conducted on a dataset of 90,000 SVG-caption pairs and evaluated on a stratified benchmark of 400 samples, show that this vector-native approach outperforms zero-shot raster baselines. Specifically, the SPE + Qwen2-7B configuration achieves a CLIPScore of 29.3 and a BLEU-1 score of 0.42, offering a parameter-efficient alternative to vision-encoder-based methods. I provide a detailed analysis of the model’s capabilities and limitations, offering a new perspective on vector-language understanding.

LLM-based SVG image captioning with conceptual embeddings

DI LUZIO, EMANUELE

2024/2025

Abstract

Scalable Vector Graphics (SVG) are ubiquitous in modern web design, yet current Multimodal Large Language Models (MLLMs) struggle to interpret their raw XML structure. Most existing approaches rely on rasterizing SVGs into pixel grids, a process that discards the semantic richness of the vector definition and introduces resolution artifacts. In this work, I present a novel architecture that enables a decoder-only LLM to directly ingest and interpret SVG code without rasterization. I integrate a SVG Path Embedder (SPE) – originally developed by Zini et al. – that maps continuous geometric coordinates into the LLM’s embedding space using sinusoidal functions. By combining this encoder with Qwen2-7B and Gemma-9B via Low-Rank Adaptation (LoRA), I demonstrate that the model can learn to "read" vector geometries effectively. My experiments, conducted on a dataset of 90,000 SVG-caption pairs and evaluated on a stratified benchmark of 400 samples, show that this vector-native approach outperforms zero-shot raster baselines. Specifically, the SPE + Qwen2-7B configuration achieves a CLIPScore of 29.3 and a BLEU-1 score of 0.42, offering a parameter-efficient alternative to vision-encoder-based methods. I provide a detailed analysis of the model’s capabilities and limitations, offering a new perspective on vector-language understanding.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				LLM-based SVG image captioning with conceptual embeddings
			
	Abstract in italiano
	
				Scalable Vector Graphics (SVG) are ubiquitous in modern web design, yet current Multimodal Large Language Models (MLLMs) struggle to interpret their raw XML structure. Most existing approaches rely on rasterizing SVGs into pixel grids, a process that discards the semantic richness of the vector definition and introduces resolution artifacts.

In this work, I present a novel architecture that enables a decoder-only LLM to directly ingest and interpret SVG code without rasterization. I integrate a SVG Path Embedder (SPE) – originally developed by Zini et al. – that maps continuous geometric coordinates into the LLM’s embedding space using sinusoidal functions. By combining this encoder with Qwen2-7B and Gemma-9B via Low-Rank Adaptation (LoRA), I demonstrate that the model can learn to "read" vector geometries effectively.

My experiments, conducted on a dataset of 90,000 SVG-caption pairs and evaluated on a stratified benchmark of 400 samples, show that this vector-native approach outperforms zero-shot raster baselines. Specifically, the SPE + Qwen2-7B configuration achieves a CLIPScore of 29.3 and a BLEU-1 score of 0.42, offering a parameter-efficient alternative to vision-encoder-based methods. I provide a detailed analysis of the model’s capabilities and limitations, offering a new perspective on vector-language understanding.
			
	Parola chiave
	
				SPE–LLM pipeline
Captioning SVG
Transformer
Fine-tuning LoRA
SVG tokenization
			
	Relatore
	
				BARALDI, LORENZO
			
	Controrelatore
	
				ZINI, LEONARDO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
DiLuzio.Emanuele.pdf accesso aperto Dimensione 3.33 MB Formato Adobe PDF Visualizza/Apri	3.33 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4722