Table Overlap under Order and Contiguity Constraints

A vast number of tables can be found on the Web, and redundant data often appear as pairs of tables with substantial overlap. This overlap suggests important relationships between tables, such as duplicate or related tables. This thesis investigates the development of an efficient method to determine the largest overlap between two tables under order and contiguity constraints. These constraints limit our ability to freely rearrange rows and columns, making the overlap problem more representative of real-world scenarios. In fact, in many cases, a meaningful overlap can only be identified when the original order of rows and columns is preserved, ensuring that the structural and semantic relationships within the tables remain intact. To tackle this, two complementary approaches are proposed. The first exploits the classical Longest Common Subsequence and Longest Common Substring algorithms. The second constructs a mapping of all common elements between the two tables and, using pairs of indexes that respect both row and column constraints, identifies the largest common subtable. The final framework is designed to be flexible, allowing users to specify any desired constraints on rows and columns, such as ordering or contiguity, or even to set no constraints at all. In this way, the approach can adapt to different real-world requirements, ranging from fully unconstrained matching to stricter scenarios. We evaluated our methods using two real-world datasets, Wiki-tables and Git-tables, and demonstrated their efficacy in solving this task. We also observed that in some cases the first approach is more effective, while in other contexts the second approach yields better results. For this reason, we have decided to combine them in a hybrid approach, which dynamically selects which of the two methods to use based on the characteristics of the input tables.

Table Overlap under Order and Contiguity Constraints

LUPO, DAVIDE ROSARIO

2024/2025

Abstract

A vast number of tables can be found on the Web, and redundant data often appear as pairs of tables with substantial overlap. This overlap suggests important relationships between tables, such as duplicate or related tables. This thesis investigates the development of an efficient method to determine the largest overlap between two tables under order and contiguity constraints. These constraints limit our ability to freely rearrange rows and columns, making the overlap problem more representative of real-world scenarios. In fact, in many cases, a meaningful overlap can only be identified when the original order of rows and columns is preserved, ensuring that the structural and semantic relationships within the tables remain intact. To tackle this, two complementary approaches are proposed. The first exploits the classical Longest Common Subsequence and Longest Common Substring algorithms. The second constructs a mapping of all common elements between the two tables and, using pairs of indexes that respect both row and column constraints, identifies the largest common subtable. The final framework is designed to be flexible, allowing users to specify any desired constraints on rows and columns, such as ordering or contiguity, or even to set no constraints at all. In this way, the approach can adapt to different real-world requirements, ranging from fully unconstrained matching to stricter scenarios. We evaluated our methods using two real-world datasets, Wiki-tables and Git-tables, and demonstrated their efficacy in solving this task. We also observed that in some cases the first approach is more effective, while in other contexts the second approach yields better results. For this reason, we have decided to combine them in a hybrid approach, which dynamically selects which of the two methods to use based on the characteristics of the input tables.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Ingegneria informatica
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				Table overlap
Order and contiguity
Constraints
Data redundancy
Algorithms
			
	Relatore
	
				SIMONINI, GIOVANNI
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Lupo.DavideRosario.pdf accesso aperto Dimensione 3.46 MB Formato Adobe PDF Visualizza/Apri	3.46 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4069