Corporate non-financial reports are a key resource for evaluating companies’ sustainability performance and their adherence to Environmental, Social, and Governance (ESG) principles. These reports are widely consulted by investors, regulators, and stakeholders, yet their automated analysis remains highly challenging due to heterogeneous structures, specialized terminology, and the coexistence of text with complex tables. To address these issues, the thesis introduces two benchmark datasets designed for hybrid text-and-table reasoning. The first focuses on a monotable setting, where claims are verified against a single table and its accompanying text. The second extends this framework to a multitable scenario, involving up to five interdependent tables whose values and relationships must be jointly considered. To validate the relevance of the proposed benchmarks, an evaluation was conducted using state-of-the-art Large Language Models (LLMs), including GPT-o4 mini, Qwen, and LLaMA. This evaluation highlighted that performance is modest even in the monotable setting and drops substantially when reasoning across multiple linked tables. These results emphasize both the complexity of claim verification in non-financial reports and the importance of the proposed datasets as a foundation for advancing research in hybrid reasoning.
Hybrid Claim Verification with Large Language Models: A Benchmark on Corporate Reports
BRUNELLI, SIMONE
2024/2025
Abstract
Corporate non-financial reports are a key resource for evaluating companies’ sustainability performance and their adherence to Environmental, Social, and Governance (ESG) principles. These reports are widely consulted by investors, regulators, and stakeholders, yet their automated analysis remains highly challenging due to heterogeneous structures, specialized terminology, and the coexistence of text with complex tables. To address these issues, the thesis introduces two benchmark datasets designed for hybrid text-and-table reasoning. The first focuses on a monotable setting, where claims are verified against a single table and its accompanying text. The second extends this framework to a multitable scenario, involving up to five interdependent tables whose values and relationships must be jointly considered. To validate the relevance of the proposed benchmarks, an evaluation was conducted using state-of-the-art Large Language Models (LLMs), including GPT-o4 mini, Qwen, and LLaMA. This evaluation highlighted that performance is modest even in the monotable setting and drops substantially when reasoning across multiple linked tables. These results emphasize both the complexity of claim verification in non-financial reports and the importance of the proposed datasets as a foundation for advancing research in hybrid reasoning.| File | Dimensione | Formato | |
|---|---|---|---|
|
Brunelli.Simone.pdf
embargo fino al 02/12/2026
Dimensione
14.5 MB
Formato
Adobe PDF
|
14.5 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/3934