A vast number of tables can be found on the Web, and redundant data often appear as pairs of tables with substantial overlap. This overlap suggests important relationships between tables, such as duplicate or related tables. This thesis investigates the development of an efficient method to determine the largest overlap between two tables under order and contiguity constraints. These constraints limit our ability to freely rearrange rows and columns, making the overlap problem more representative of real-world scenarios. In fact, in many cases, a meaningful overlap can only be identified when the original order of rows and columns is preserved, ensuring that the structural and semantic relationships within the tables remain intact. To tackle this, two complementary approaches are proposed. The first exploits the classical Longest Common Subsequence and Longest Common Substring algorithms. The second constructs a mapping of all common elements between the two tables and, using pairs of indexes that respect both row and column constraints, identifies the largest common subtable. The final framework is designed to be flexible, allowing users to specify any desired constraints on rows and columns, such as ordering or contiguity, or even to set no constraints at all. In this way, the approach can adapt to different real-world requirements, ranging from fully unconstrained matching to stricter scenarios. We evaluated our methods using two real-world datasets, Wiki-tables and Git-tables, and demonstrated their efficacy in solving this task. We also observed that in some cases the first approach is more effective, while in other contexts the second approach yields better results. For this reason, we have decided to combine them in a hybrid approach, which dynamically selects which of the two methods to use based on the characteristics of the input tables.
Table Overlap under Order and Contiguity Constraints
LUPO, DAVIDE ROSARIO
2024/2025
Abstract
A vast number of tables can be found on the Web, and redundant data often appear as pairs of tables with substantial overlap. This overlap suggests important relationships between tables, such as duplicate or related tables. This thesis investigates the development of an efficient method to determine the largest overlap between two tables under order and contiguity constraints. These constraints limit our ability to freely rearrange rows and columns, making the overlap problem more representative of real-world scenarios. In fact, in many cases, a meaningful overlap can only be identified when the original order of rows and columns is preserved, ensuring that the structural and semantic relationships within the tables remain intact. To tackle this, two complementary approaches are proposed. The first exploits the classical Longest Common Subsequence and Longest Common Substring algorithms. The second constructs a mapping of all common elements between the two tables and, using pairs of indexes that respect both row and column constraints, identifies the largest common subtable. The final framework is designed to be flexible, allowing users to specify any desired constraints on rows and columns, such as ordering or contiguity, or even to set no constraints at all. In this way, the approach can adapt to different real-world requirements, ranging from fully unconstrained matching to stricter scenarios. We evaluated our methods using two real-world datasets, Wiki-tables and Git-tables, and demonstrated their efficacy in solving this task. We also observed that in some cases the first approach is more effective, while in other contexts the second approach yields better results. For this reason, we have decided to combine them in a hybrid approach, which dynamically selects which of the two methods to use based on the characteristics of the input tables.| File | Dimensione | Formato | |
|---|---|---|---|
|
Lupo.DavideRosario.pdf
accesso aperto
Dimensione
3.46 MB
Formato
Adobe PDF
|
3.46 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/4069