This thesis investigates methods to enhance the performance of small-scale language models on knowledge-intensive Visual Question Answering (VQA) tasks. While large models demonstrate strong capabilities in handling factual queries, smaller architectures often struggle when external knowledge is required. To address this challenge, we propose a VQA pipeline that integrates a retrieval mechanism inspired by Retrieval-Augmented Generation (RAG). The system enriches each question with external context retrieved from high-quality encyclopedic sources, such as Wikipedia, which are organized into semantic clusters to facilitate efficient access. By dynamically supplementing the input prompt with relevant evidence instead of relying solely on the model’s parametric memory, the approach improves answer accuracy while maintaining low computational overhead. The framework combines lightweight retrieval, context selection, and generation into a unified architecture, demonstrating that small language models can effectively benefit from structured external knowledge in multimodal reasoning scenarios.
This thesis investigates methods to enhance the performance of small-scale language models on knowledge-intensive Visual Question Answering (VQA) tasks. While large models demonstrate strong capabilities in handling factual queries, smaller architectures often struggle when external knowledge is required. To address this challenge, we propose a VQA pipeline that integrates a retrieval mechanism inspired by Retrieval-Augmented Generation (RAG). The system enriches each question with external context retrieved from high-quality encyclopedic sources, such as Wikipedia, which are organized into semantic clusters to facilitate efficient access. By dynamically supplementing the input prompt with relevant evidence instead of relying solely on the model’s parametric memory, the approach improves answer accuracy while maintaining low computational overhead. The framework combines lightweight retrieval, context selection, and generation into a unified architecture, demonstrating that small language models can effectively benefit from structured external knowledge in multimodal reasoning scenarios.
Hierarchical Knowledge Retrieval for Visual Question Answering with Lightweight Language Models
DI BIASE, FABIO
2024/2025
Abstract
This thesis investigates methods to enhance the performance of small-scale language models on knowledge-intensive Visual Question Answering (VQA) tasks. While large models demonstrate strong capabilities in handling factual queries, smaller architectures often struggle when external knowledge is required. To address this challenge, we propose a VQA pipeline that integrates a retrieval mechanism inspired by Retrieval-Augmented Generation (RAG). The system enriches each question with external context retrieved from high-quality encyclopedic sources, such as Wikipedia, which are organized into semantic clusters to facilitate efficient access. By dynamically supplementing the input prompt with relevant evidence instead of relying solely on the model’s parametric memory, the approach improves answer accuracy while maintaining low computational overhead. The framework combines lightweight retrieval, context selection, and generation into a unified architecture, demonstrating that small language models can effectively benefit from structured external knowledge in multimodal reasoning scenarios.| File | Dimensione | Formato | |
|---|---|---|---|
|
DiBiase.Fabio.pdf
accesso aperto
Dimensione
1.57 MB
Formato
Adobe PDF
|
1.57 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/3808