This thesis addresses the increasing demand from financial institutions and ESG-oriented (Environmental, Social, and Governance) organizations for automated solutions capable of improving the retrieval, interpretation and analysis of information contained in corporate reports, with particular emphasis on structured financial data embedded in complex tabular structures. While modern Artificial Intelligence (AI) techniques have significantly advanced document analysis, extracting reliable and structured information from complex and heterogeneous corporate documents remains a challenging task, particularly in ESG reporting contexts where numerical accuracy and semantic consistency are critical for supporting transparent sustainability reporting and data-driven decision-making. Particular attention is given to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) architectures, which provide a promising framework for enhancing information access and analytical capabilities. However, the effectiveness of these systems is strongly dependent on the accuracy and reliability of document data extraction processes, especially when dealing with heterogeneous information such as numerical data, textual content and graphical indicators including embedded symbols and icons. To address this challenge, the thesis proposes a structured pipeline for generating and processing PDF-based datasets specifically designed to evaluate document extraction tasks involving complex tabular structures. The pipeline enables controlled generation of evaluation documents and the systematic benchmarking of document understanding tools and AI models. An experimental evaluation is conducted to assess the performance of different extraction approaches in terms of accuracy, robustness, and suitability for ESG-related analytical workflows. The results contribute to identifying the strengths and limitations of current AI-based document extraction techniques and provide insights into their integration within intelligent ESG reporting systems.

Document Understanding for Automated ESG Reporting: System Design and Experimental Validation

BERTACCHINI, GIORGIA
2024/2025

Abstract

This thesis addresses the increasing demand from financial institutions and ESG-oriented (Environmental, Social, and Governance) organizations for automated solutions capable of improving the retrieval, interpretation and analysis of information contained in corporate reports, with particular emphasis on structured financial data embedded in complex tabular structures. While modern Artificial Intelligence (AI) techniques have significantly advanced document analysis, extracting reliable and structured information from complex and heterogeneous corporate documents remains a challenging task, particularly in ESG reporting contexts where numerical accuracy and semantic consistency are critical for supporting transparent sustainability reporting and data-driven decision-making. Particular attention is given to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) architectures, which provide a promising framework for enhancing information access and analytical capabilities. However, the effectiveness of these systems is strongly dependent on the accuracy and reliability of document data extraction processes, especially when dealing with heterogeneous information such as numerical data, textual content and graphical indicators including embedded symbols and icons. To address this challenge, the thesis proposes a structured pipeline for generating and processing PDF-based datasets specifically designed to evaluate document extraction tasks involving complex tabular structures. The pipeline enables controlled generation of evaluation documents and the systematic benchmarking of document understanding tools and AI models. An experimental evaluation is conducted to assess the performance of different extraction approaches in terms of accuracy, robustness, and suitability for ESG-related analytical workflows. The results contribute to identifying the strengths and limitations of current AI-based document extraction techniques and provide insights into their integration within intelligent ESG reporting systems.
2024
ESG
Document Extraction
Tabular structures
AI
RAG
File in questo prodotto:
File Dimensione Formato  
Bertacchini.Giorgia.pdf

accesso aperto

Dimensione 677.49 kB
Formato Adobe PDF
677.49 kB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/5711