With the increasing volume and variety of data generated across systems, integrating information from multiple and heterogeneous sources has become a crucial task for many organizations. This thesis investigates the use of Large Language Models (LLMs) for automating key tasks in data integration, including schema matching, entity resolution, and data fusion. By leveraging the contextual understanding and generalization capabilities of state-of-the-art models such as GPT-4.1, the study demonstrates that LLMs can outperform several established approaches with the absence or minimal supervision. A particular focus is placed on the scalability challenges of entity matching, addressed through a custom blocking mechanism and a cost-efficient three-step LLM pipeline that reduces resource consumption by 70% without compromising accuracy. The data fusion stage further highlights the ability of LLMs to resolve conflicts and synthesize reliable values using semantic reasoning and context enhanced by Retrieval Augmented Generation (RAG). The system is deployed within a modular multi-agent architecture, promoting automation while ensuring user control and transparency. While the results are promising, limitations such as high computational costs, inference latency, and the non-deterministic nature of LLMs pose challenges to their industrial adoption. This work offers a foundational exploration of LLMs in data integration and outlines future directions for improving efficiency, scalability, and robustness in real-world application.

Analysis and Application of Large Language Models in Data Integration.

RISTORI, PAOLO
2024/2025

Abstract

With the increasing volume and variety of data generated across systems, integrating information from multiple and heterogeneous sources has become a crucial task for many organizations. This thesis investigates the use of Large Language Models (LLMs) for automating key tasks in data integration, including schema matching, entity resolution, and data fusion. By leveraging the contextual understanding and generalization capabilities of state-of-the-art models such as GPT-4.1, the study demonstrates that LLMs can outperform several established approaches with the absence or minimal supervision. A particular focus is placed on the scalability challenges of entity matching, addressed through a custom blocking mechanism and a cost-efficient three-step LLM pipeline that reduces resource consumption by 70% without compromising accuracy. The data fusion stage further highlights the ability of LLMs to resolve conflicts and synthesize reliable values using semantic reasoning and context enhanced by Retrieval Augmented Generation (RAG). The system is deployed within a modular multi-agent architecture, promoting automation while ensuring user control and transparency. While the results are promising, limitations such as high computational costs, inference latency, and the non-deterministic nature of LLMs pose challenges to their industrial adoption. This work offers a foundational exploration of LLMs in data integration and outlines future directions for improving efficiency, scalability, and robustness in real-world application.
2024
LLM
Data Integration
Entity matching
Automation
RAG
File in questo prodotto:
File Dimensione Formato  
Ristori.Paolo.pdf

Accesso riservato

Dimensione 1.28 MB
Formato Adobe PDF
1.28 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/3658