Self-Elicited Multimodal LLMs for Knowledge-based Vision Question Answering

This thesis proposes a unified framework to apply and evaluate self-elicit strategies to dedicated inference steps of Multimodal Large Language Models (MLLMs) whose usage is built upon external visual question answering-based data. The self-elicitation techniques applied take advantage of the internal attention distributions of the MLLM automatically identify and mark salient evidence sentences within long textual contexts. This approach enables unsupervised evidence selection without additional annotations. Evaluations on visual question answering benchmarks demonstrate that attention-guided self-elicitation improves the baseline MLLM. To manage potential issues relative to the quadratic scales of the attention matrices during the self-elicitation phase, different strategies have been tested to optimize and filter data before passing it to the model. All experiments are based on the Encyclopedic-VQA dataset, which contains Wikipedia pages as external data source. In addition to the gold evidence passages used as an upper-bound evaluation, external textual knowledge is retrieved from the external knowledge base using two alternative retriever modules: Google Lens and EVA-CLIP, both of which return a set of candidate Wikipedia pages for each visual question. From the retrieved pages, textual content is considered at different levels of granularity, ranging from full-page contexts to smaller passage-level segments. To reduce context length and eliminate irrelevant information, an optional critic model is applied as a post-retrieval filtering step, selecting only passages deemed relevant to the image-question pair before the self-elicitation inference step. This design enables controlled comparisons between page-level and passage-level contexts with and without critic-based filtering

Self-Elicited Multimodal LLMs for Knowledge-based Vision Question Answering

BONACORSI, LUCA

2024/2025

Abstract

This thesis proposes a unified framework to apply and evaluate self-elicit strategies to dedicated inference steps of Multimodal Large Language Models (MLLMs) whose usage is built upon external visual question answering-based data. The self-elicitation techniques applied take advantage of the internal attention distributions of the MLLM automatically identify and mark salient evidence sentences within long textual contexts. This approach enables unsupervised evidence selection without additional annotations. Evaluations on visual question answering benchmarks demonstrate that attention-guided self-elicitation improves the baseline MLLM. To manage potential issues relative to the quadratic scales of the attention matrices during the self-elicitation phase, different strategies have been tested to optimize and filter data before passing it to the model. All experiments are based on the Encyclopedic-VQA dataset, which contains Wikipedia pages as external data source. In addition to the gold evidence passages used as an upper-bound evaluation, external textual knowledge is retrieved from the external knowledge base using two alternative retriever modules: Google Lens and EVA-CLIP, both of which return a set of candidate Wikipedia pages for each visual question. From the retrieved pages, textual content is considered at different levels of granularity, ranging from full-page contexts to smaller passage-level segments. To reduce context length and eliminate irrelevant information, an optional critic model is applied as a post-retrieval filtering step, selecting only passages deemed relevant to the image-question pair before the self-elicitation inference step. This design enables controlled comparisons between page-level and passage-level contexts with and without critic-based filtering

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				RAG
Self-Elicitation
Self-Attention
VQA
Qwen
			
	Relatore
	
				CORNIA, MARCELLA
BARALDI, LORENZO
			
	Controrelatore
	
				CAFFAGNI, DAVIDE
COMPAGNONI, ALBERTO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Bonacorsi.Luca.pdf embargo fino al 13/08/2027 Dimensione 6.65 MB Formato Adobe PDF	6.65 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4721