Corporate non-financial reports are a key resource for evaluating companies’ sustainability performance and their adherence to Environmental, Social, and Governance (ESG) principles. These reports are widely consulted by investors, regulators, and stakeholders, yet their automated analysis remains highly challenging due to heterogeneous structures, specialized terminology, and the coexistence of text with complex tables. To address these issues, the thesis introduces two benchmark datasets designed for hybrid text-and-table reasoning. The first focuses on a monotable setting, where claims are verified against a single table and its accompanying text. The second extends this framework to a multitable scenario, involving up to five interdependent tables whose values and relationships must be jointly considered. To validate the relevance of the proposed benchmarks, an evaluation was conducted using state-of-the-art Large Language Models (LLMs), including GPT-o4 mini, Qwen, and LLaMA. This evaluation highlighted that performance is modest even in the monotable setting and drops substantially when reasoning across multiple linked tables. These results emphasize both the complexity of claim verification in non-financial reports and the importance of the proposed datasets as a foundation for advancing research in hybrid reasoning.

Hybrid Claim Verification with Large Language Models: A Benchmark on Corporate Reports

BRUNELLI, SIMONE
2024/2025

Abstract

Corporate non-financial reports are a key resource for evaluating companies’ sustainability performance and their adherence to Environmental, Social, and Governance (ESG) principles. These reports are widely consulted by investors, regulators, and stakeholders, yet their automated analysis remains highly challenging due to heterogeneous structures, specialized terminology, and the coexistence of text with complex tables. To address these issues, the thesis introduces two benchmark datasets designed for hybrid text-and-table reasoning. The first focuses on a monotable setting, where claims are verified against a single table and its accompanying text. The second extends this framework to a multitable scenario, involving up to five interdependent tables whose values and relationships must be jointly considered. To validate the relevance of the proposed benchmarks, an evaluation was conducted using state-of-the-art Large Language Models (LLMs), including GPT-o4 mini, Qwen, and LLaMA. This evaluation highlighted that performance is modest even in the monotable setting and drops substantially when reasoning across multiple linked tables. These results emphasize both the complexity of claim verification in non-financial reports and the importance of the proposed datasets as a foundation for advancing research in hybrid reasoning.
2024
Claim verification
Fact checking
Large language model
ESG
Tabular reasoning
File in questo prodotto:
File Dimensione Formato  
Brunelli.Simone.pdf

embargo fino al 02/12/2026

Dimensione 14.5 MB
Formato Adobe PDF
14.5 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/3934