Current benchmarks for Question Answering (QA) on tabular data mainly focus on reasoning within single tables, while multi-table reasoning remains largely underexplored. Existing multitable benchmarks present relevant limitations: they are often domain-specific, support only a limited range of queries, and involve a relatively small set of tables. Furthermore, the manual annotation required to construct such datasets is highly time-consuming. These limitations highlight the need for methods that can automatically generate multi-table benchmarks without human annotation, support more diverse question types, and enable reasoning across a broader and more complex collection of tables. The thesis proposes a pipeline for the automatic construction of multi-table question answering datasets starting from existing single-table benchmarks, such as FinQA, TATQA and GRI-QA. Tables are first converted into a standardized textual representation and clustered based on the similarity of their headers using sentence transformers. Within each cluster, rows are compared using sentence transformers, while columns are analyzed using Jaccard similarity. Once correspondences are identified, values are sampled and combined through reasoning operations such as extraction, comparison, and aggregation. To ensure consistent reasoning across tables, the pipeline also detects and standardizes units of measurement, addressing a common source of error in numerical reasoning when integrating heterogeneous sources. For each generated instance, the computed value is treated as the answer and used, together with the tables, to guide a Large Language Model in generating a natural language question. This workflow enables the scalable creation of multi-table examples covering diverse reasoning patterns. Finally, the thesis evaluates the quality of the generated dataset through both automated analysis using Large Language Models and human evaluation. The results indicate that automatic benchmark construction is a viable and scalable approach for expanding the evaluation of multi-table question answering systems. At the same time, the experiments show that current Large Language Models achieve strong performances on simpler and more structurally regular instances, while they remain less reliable when the questions require more complex reasoning across multiple tables, particularly in the presence of aggregation, larger sets of tables and heterogeneous units.

From Tables to Questions: Automatic Benchmark Construction for Multi-Table Question Answering

DE BELLIS, ELENA MARIA
2024/2025

Abstract

Current benchmarks for Question Answering (QA) on tabular data mainly focus on reasoning within single tables, while multi-table reasoning remains largely underexplored. Existing multitable benchmarks present relevant limitations: they are often domain-specific, support only a limited range of queries, and involve a relatively small set of tables. Furthermore, the manual annotation required to construct such datasets is highly time-consuming. These limitations highlight the need for methods that can automatically generate multi-table benchmarks without human annotation, support more diverse question types, and enable reasoning across a broader and more complex collection of tables. The thesis proposes a pipeline for the automatic construction of multi-table question answering datasets starting from existing single-table benchmarks, such as FinQA, TATQA and GRI-QA. Tables are first converted into a standardized textual representation and clustered based on the similarity of their headers using sentence transformers. Within each cluster, rows are compared using sentence transformers, while columns are analyzed using Jaccard similarity. Once correspondences are identified, values are sampled and combined through reasoning operations such as extraction, comparison, and aggregation. To ensure consistent reasoning across tables, the pipeline also detects and standardizes units of measurement, addressing a common source of error in numerical reasoning when integrating heterogeneous sources. For each generated instance, the computed value is treated as the answer and used, together with the tables, to guide a Large Language Model in generating a natural language question. This workflow enables the scalable creation of multi-table examples covering diverse reasoning patterns. Finally, the thesis evaluates the quality of the generated dataset through both automated analysis using Large Language Models and human evaluation. The results indicate that automatic benchmark construction is a viable and scalable approach for expanding the evaluation of multi-table question answering systems. At the same time, the experiments show that current Large Language Models achieve strong performances on simpler and more structurally regular instances, while they remain less reliable when the questions require more complex reasoning across multiple tables, particularly in the presence of aggregation, larger sets of tables and heterogeneous units.
2024
From Tables to Questions: Automatic Benchmark Construction for Multi-Table Question Answering
Question generation
Generalization
Clustering
Multihop QA
Table QA
File in questo prodotto:
File Dimensione Formato  
DeBellis.ElenaMaria.pdf

embargo fino al 08/04/2029

Dimensione 1.72 MB
Formato Adobe PDF
1.72 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/5358