Analysis and Application of Large Language Models in Data Integration.

With the increasing volume and variety of data generated across systems, integrating information from multiple and heterogeneous sources has become a crucial task for many organizations. This thesis investigates the use of Large Language Models (LLMs) for automating key tasks in data integration, including schema matching, entity resolution, and data fusion. By leveraging the contextual understanding and generalization capabilities of state-of-the-art models such as GPT-4.1, the study demonstrates that LLMs can outperform several established approaches with the absence or minimal supervision. A particular focus is placed on the scalability challenges of entity matching, addressed through a custom blocking mechanism and a cost-efficient three-step LLM pipeline that reduces resource consumption by 70% without compromising accuracy. The data fusion stage further highlights the ability of LLMs to resolve conflicts and synthesize reliable values using semantic reasoning and context enhanced by Retrieval Augmented Generation (RAG). The system is deployed within a modular multi-agent architecture, promoting automation while ensuring user control and transparency. While the results are promising, limitations such as high computational costs, inference latency, and the non-deterministic nature of LLMs pose challenges to their industrial adoption. This work offers a foundational exploration of LLMs in data integration and outlines future directions for improving efficiency, scalability, and robustness in real-world application.

Analysis and Application of Large Language Models in Data Integration.

RISTORI, PAOLO

2024/2025

Abstract

With the increasing volume and variety of data generated across systems, integrating information from multiple and heterogeneous sources has become a crucial task for many organizations. This thesis investigates the use of Large Language Models (LLMs) for automating key tasks in data integration, including schema matching, entity resolution, and data fusion. By leveraging the contextual understanding and generalization capabilities of state-of-the-art models such as GPT-4.1, the study demonstrates that LLMs can outperform several established approaches with the absence or minimal supervision. A particular focus is placed on the scalability challenges of entity matching, addressed through a custom blocking mechanism and a cost-efficient three-step LLM pipeline that reduces resource consumption by 70% without compromising accuracy. The data fusion stage further highlights the ability of LLMs to resolve conflicts and synthesize reliable values using semantic reasoning and context enhanced by Retrieval Augmented Generation (RAG). The system is deployed within a modular multi-agent architecture, promoting automation while ensuring user control and transparency. While the results are promising, limitations such as high computational costs, inference latency, and the non-deterministic nature of LLMs pose challenges to their industrial adoption. This work offers a foundational exploration of LLMs in data integration and outlines future directions for improving efficiency, scalability, and robustness in real-world application.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Ingegneria informatica
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				LLM
Data Integration
Entity matching
Automation
RAG
			
	Relatore
	
				CALDERARA, SIMONE
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Ristori.Paolo.pdf Accesso riservato Dimensione 1.28 MB Formato Adobe PDF	1.28 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/3658