This thesis proposes a unified framework to apply and evaluate self-elicit strategies to dedicated inference steps of Multimodal Large Language Models (MLLMs) whose usage is built upon external visual question answering-based data. The self-elicitation techniques applied take advantage of the internal attention distributions of the MLLM automatically identify and mark salient evidence sentences within long textual contexts. This approach enables unsupervised evidence selection without additional annotations. Evaluations on visual question answering benchmarks demonstrate that attention-guided self-elicitation improves the baseline MLLM. To manage potential issues relative to the quadratic scales of the attention matrices during the self-elicitation phase, different strategies have been tested to optimize and filter data before passing it to the model. All experiments are based on the Encyclopedic-VQA dataset, which contains Wikipedia pages as external data source. In addition to the gold evidence passages used as an upper-bound evaluation, external textual knowledge is retrieved from the external knowledge base using two alternative retriever modules: Google Lens and EVA-CLIP, both of which return a set of candidate Wikipedia pages for each visual question. From the retrieved pages, textual content is considered at different levels of granularity, ranging from full-page contexts to smaller passage-level segments. To reduce context length and eliminate irrelevant information, an optional critic model is applied as a post-retrieval filtering step, selecting only passages deemed relevant to the image-question pair before the self-elicitation inference step. This design enables controlled comparisons between page-level and passage-level contexts with and without critic-based filtering

Self-Elicited Multimodal LLMs for Knowledge-based Vision Question Answering

BONACORSI, LUCA
2024/2025

Abstract

This thesis proposes a unified framework to apply and evaluate self-elicit strategies to dedicated inference steps of Multimodal Large Language Models (MLLMs) whose usage is built upon external visual question answering-based data. The self-elicitation techniques applied take advantage of the internal attention distributions of the MLLM automatically identify and mark salient evidence sentences within long textual contexts. This approach enables unsupervised evidence selection without additional annotations. Evaluations on visual question answering benchmarks demonstrate that attention-guided self-elicitation improves the baseline MLLM. To manage potential issues relative to the quadratic scales of the attention matrices during the self-elicitation phase, different strategies have been tested to optimize and filter data before passing it to the model. All experiments are based on the Encyclopedic-VQA dataset, which contains Wikipedia pages as external data source. In addition to the gold evidence passages used as an upper-bound evaluation, external textual knowledge is retrieved from the external knowledge base using two alternative retriever modules: Google Lens and EVA-CLIP, both of which return a set of candidate Wikipedia pages for each visual question. From the retrieved pages, textual content is considered at different levels of granularity, ranging from full-page contexts to smaller passage-level segments. To reduce context length and eliminate irrelevant information, an optional critic model is applied as a post-retrieval filtering step, selecting only passages deemed relevant to the image-question pair before the self-elicitation inference step. This design enables controlled comparisons between page-level and passage-level contexts with and without critic-based filtering
2024
RAG
Self-Elicitation
Self-Attention
VQA
Qwen
File in questo prodotto:
File Dimensione Formato  
Bonacorsi.Luca.pdf

embargo fino al 13/08/2027

Dimensione 6.65 MB
Formato Adobe PDF
6.65 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4721