This thesis proposes a unified framework to apply and evaluate self-elicit strategies to dedicated inference steps of Multimodal Large Language Models (MLLMs) whose usage is built upon external visual question answering-based data. The self-elicitation techniques applied take advantage of the internal attention distributions of the MLLM automatically identify and mark salient evidence sentences within long textual contexts. This approach enables unsupervised evidence selection without additional annotations. Evaluations on visual question answering benchmarks demonstrate that attention-guided self-elicitation improves the baseline MLLM. To manage potential issues relative to the quadratic scales of the attention matrices during the self-elicitation phase, different strategies have been tested to optimize and filter data before passing it to the model. All experiments are based on the Encyclopedic-VQA dataset, which contains Wikipedia pages as external data source. In addition to the gold evidence passages used as an upper-bound evaluation, external textual knowledge is retrieved from the external knowledge base using two alternative retriever modules: Google Lens and EVA-CLIP, both of which return a set of candidate Wikipedia pages for each visual question. From the retrieved pages, textual content is considered at different levels of granularity, ranging from full-page contexts to smaller passage-level segments. To reduce context length and eliminate irrelevant information, an optional critic model is applied as a post-retrieval filtering step, selecting only passages deemed relevant to the image-question pair before the self-elicitation inference step. This design enables controlled comparisons between page-level and passage-level contexts with and without critic-based filtering
Self-Elicited Multimodal LLMs for Knowledge-based Vision Question Answering
BONACORSI, LUCA
2024/2025
Abstract
This thesis proposes a unified framework to apply and evaluate self-elicit strategies to dedicated inference steps of Multimodal Large Language Models (MLLMs) whose usage is built upon external visual question answering-based data. The self-elicitation techniques applied take advantage of the internal attention distributions of the MLLM automatically identify and mark salient evidence sentences within long textual contexts. This approach enables unsupervised evidence selection without additional annotations. Evaluations on visual question answering benchmarks demonstrate that attention-guided self-elicitation improves the baseline MLLM. To manage potential issues relative to the quadratic scales of the attention matrices during the self-elicitation phase, different strategies have been tested to optimize and filter data before passing it to the model. All experiments are based on the Encyclopedic-VQA dataset, which contains Wikipedia pages as external data source. In addition to the gold evidence passages used as an upper-bound evaluation, external textual knowledge is retrieved from the external knowledge base using two alternative retriever modules: Google Lens and EVA-CLIP, both of which return a set of candidate Wikipedia pages for each visual question. From the retrieved pages, textual content is considered at different levels of granularity, ranging from full-page contexts to smaller passage-level segments. To reduce context length and eliminate irrelevant information, an optional critic model is applied as a post-retrieval filtering step, selecting only passages deemed relevant to the image-question pair before the self-elicitation inference step. This design enables controlled comparisons between page-level and passage-level contexts with and without critic-based filtering| File | Dimensione | Formato | |
|---|---|---|---|
|
Bonacorsi.Luca.pdf
embargo fino al 13/08/2027
Dimensione
6.65 MB
Formato
Adobe PDF
|
6.65 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/4721