Scalable Vector Graphics (SVG) are ubiquitous in modern web design, yet current Multimodal Large Language Models (MLLMs) struggle to interpret their raw XML structure. Most existing approaches rely on rasterizing SVGs into pixel grids, a process that discards the semantic richness of the vector definition and introduces resolution artifacts. In this work, I present a novel architecture that enables a decoder-only LLM to directly ingest and interpret SVG code without rasterization. I integrate a SVG Path Embedder (SPE) – originally developed by Zini et al. – that maps continuous geometric coordinates into the LLM’s embedding space using sinusoidal functions. By combining this encoder with Qwen2-7B and Gemma-9B via Low-Rank Adaptation (LoRA), I demonstrate that the model can learn to "read" vector geometries effectively. My experiments, conducted on a dataset of 90,000 SVG-caption pairs and evaluated on a stratified benchmark of 400 samples, show that this vector-native approach outperforms zero-shot raster baselines. Specifically, the SPE + Qwen2-7B configuration achieves a CLIPScore of 29.3 and a BLEU-1 score of 0.42, offering a parameter-efficient alternative to vision-encoder-based methods. I provide a detailed analysis of the model’s capabilities and limitations, offering a new perspective on vector-language understanding.

Scalable Vector Graphics (SVG) are ubiquitous in modern web design, yet current Multimodal Large Language Models (MLLMs) struggle to interpret their raw XML structure. Most existing approaches rely on rasterizing SVGs into pixel grids, a process that discards the semantic richness of the vector definition and introduces resolution artifacts. In this work, I present a novel architecture that enables a decoder-only LLM to directly ingest and interpret SVG code without rasterization. I integrate a SVG Path Embedder (SPE) – originally developed by Zini et al. – that maps continuous geometric coordinates into the LLM’s embedding space using sinusoidal functions. By combining this encoder with Qwen2-7B and Gemma-9B via Low-Rank Adaptation (LoRA), I demonstrate that the model can learn to "read" vector geometries effectively. My experiments, conducted on a dataset of 90,000 SVG-caption pairs and evaluated on a stratified benchmark of 400 samples, show that this vector-native approach outperforms zero-shot raster baselines. Specifically, the SPE + Qwen2-7B configuration achieves a CLIPScore of 29.3 and a BLEU-1 score of 0.42, offering a parameter-efficient alternative to vision-encoder-based methods. I provide a detailed analysis of the model’s capabilities and limitations, offering a new perspective on vector-language understanding.

LLM-based SVG image captioning with conceptual embeddings

DI LUZIO, EMANUELE
2024/2025

Abstract

Scalable Vector Graphics (SVG) are ubiquitous in modern web design, yet current Multimodal Large Language Models (MLLMs) struggle to interpret their raw XML structure. Most existing approaches rely on rasterizing SVGs into pixel grids, a process that discards the semantic richness of the vector definition and introduces resolution artifacts. In this work, I present a novel architecture that enables a decoder-only LLM to directly ingest and interpret SVG code without rasterization. I integrate a SVG Path Embedder (SPE) – originally developed by Zini et al. – that maps continuous geometric coordinates into the LLM’s embedding space using sinusoidal functions. By combining this encoder with Qwen2-7B and Gemma-9B via Low-Rank Adaptation (LoRA), I demonstrate that the model can learn to "read" vector geometries effectively. My experiments, conducted on a dataset of 90,000 SVG-caption pairs and evaluated on a stratified benchmark of 400 samples, show that this vector-native approach outperforms zero-shot raster baselines. Specifically, the SPE + Qwen2-7B configuration achieves a CLIPScore of 29.3 and a BLEU-1 score of 0.42, offering a parameter-efficient alternative to vision-encoder-based methods. I provide a detailed analysis of the model’s capabilities and limitations, offering a new perspective on vector-language understanding.
2024
LLM-based SVG image captioning with conceptual embeddings
Scalable Vector Graphics (SVG) are ubiquitous in modern web design, yet current Multimodal Large Language Models (MLLMs) struggle to interpret their raw XML structure. Most existing approaches rely on rasterizing SVGs into pixel grids, a process that discards the semantic richness of the vector definition and introduces resolution artifacts. In this work, I present a novel architecture that enables a decoder-only LLM to directly ingest and interpret SVG code without rasterization. I integrate a SVG Path Embedder (SPE) – originally developed by Zini et al. – that maps continuous geometric coordinates into the LLM’s embedding space using sinusoidal functions. By combining this encoder with Qwen2-7B and Gemma-9B via Low-Rank Adaptation (LoRA), I demonstrate that the model can learn to "read" vector geometries effectively. My experiments, conducted on a dataset of 90,000 SVG-caption pairs and evaluated on a stratified benchmark of 400 samples, show that this vector-native approach outperforms zero-shot raster baselines. Specifically, the SPE + Qwen2-7B configuration achieves a CLIPScore of 29.3 and a BLEU-1 score of 0.42, offering a parameter-efficient alternative to vision-encoder-based methods. I provide a detailed analysis of the model’s capabilities and limitations, offering a new perspective on vector-language understanding.
SPE–LLM pipeline
Captioning SVG
Transformer
Fine-tuning LoRA
SVG tokenization
File in questo prodotto:
File Dimensione Formato  
DiLuzio.Emanuele.pdf

accesso aperto

Dimensione 3.33 MB
Formato Adobe PDF
3.33 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4722