Hierarchical Knowledge Retrieval for Visual Question Answering with Lightweight Language Models

This thesis investigates methods to enhance the performance of small-scale language models on knowledge-intensive Visual Question Answering (VQA) tasks. While large models demonstrate strong capabilities in handling factual queries, smaller architectures often struggle when external knowledge is required. To address this challenge, we propose a VQA pipeline that integrates a retrieval mechanism inspired by Retrieval-Augmented Generation (RAG). The system enriches each question with external context retrieved from high-quality encyclopedic sources, such as Wikipedia, which are organized into semantic clusters to facilitate efficient access. By dynamically supplementing the input prompt with relevant evidence instead of relying solely on the model’s parametric memory, the approach improves answer accuracy while maintaining low computational overhead. The framework combines lightweight retrieval, context selection, and generation into a unified architecture, demonstrating that small language models can effectively benefit from structured external knowledge in multimodal reasoning scenarios.

Hierarchical Knowledge Retrieval for Visual Question Answering with Lightweight Language Models

DI BIASE, FABIO

2024/2025

Abstract

This thesis investigates methods to enhance the performance of small-scale language models on knowledge-intensive Visual Question Answering (VQA) tasks. While large models demonstrate strong capabilities in handling factual queries, smaller architectures often struggle when external knowledge is required. To address this challenge, we propose a VQA pipeline that integrates a retrieval mechanism inspired by Retrieval-Augmented Generation (RAG). The system enriches each question with external context retrieved from high-quality encyclopedic sources, such as Wikipedia, which are organized into semantic clusters to facilitate efficient access. By dynamically supplementing the input prompt with relevant evidence instead of relying solely on the model’s parametric memory, the approach improves answer accuracy while maintaining low computational overhead. The framework combines lightweight retrieval, context selection, and generation into a unified architecture, demonstrating that small language models can effectively benefit from structured external knowledge in multimodal reasoning scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Ingegneria informatica
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Hierarchical Knowledge Retrieval for Visual Question Answering with Lightweight Language Models
			
	Abstract in italiano
	
				This thesis investigates methods to enhance the performance of small-scale language models on knowledge-intensive Visual Question Answering (VQA) tasks. While large models demonstrate strong capabilities in handling factual queries, smaller architectures often struggle when external knowledge is required. To address this challenge, we propose a VQA pipeline that integrates a retrieval mechanism inspired by Retrieval-Augmented Generation (RAG). The system enriches each question with external context retrieved from high-quality encyclopedic sources, such as Wikipedia, which are organized into semantic clusters to facilitate efficient access. By dynamically supplementing the input prompt with relevant evidence instead of relying solely on the model’s parametric memory, the approach improves answer accuracy while maintaining low computational overhead. The framework combines lightweight retrieval, context selection, and generation into a unified architecture, demonstrating that small language models can effectively benefit from structured external knowledge in multimodal reasoning scenarios.
			
	Parola chiave
	
				Multimodal LLMs
RAG
Transformers
Deep Learning
Machine Learning
			
	Relatore
	
				BARALDI, LORENZO
CORNIA, MARCELLA
			
	Controrelatore
	
				COCCHI, FEDERICO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
DiBiase.Fabio.pdf accesso aperto Dimensione 1.57 MB Formato Adobe PDF Visualizza/Apri	1.57 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/3808