With the increasing volume and variety of data generated across systems, integrating information from multiple and heterogeneous sources has become a crucial task for many organizations. This thesis investigates the use of Large Language Models (LLMs) for automating key tasks in data integration, including schema matching, entity resolution, and data fusion. By leveraging the contextual understanding and generalization capabilities of state-of-the-art models such as GPT-4.1, the study demonstrates that LLMs can outperform several established approaches with the absence or minimal supervision. A particular focus is placed on the scalability challenges of entity matching, addressed through a custom blocking mechanism and a cost-efficient three-step LLM pipeline that reduces resource consumption by 70% without compromising accuracy. The data fusion stage further highlights the ability of LLMs to resolve conflicts and synthesize reliable values using semantic reasoning and context enhanced by Retrieval Augmented Generation (RAG). The system is deployed within a modular multi-agent architecture, promoting automation while ensuring user control and transparency. While the results are promising, limitations such as high computational costs, inference latency, and the non-deterministic nature of LLMs pose challenges to their industrial adoption. This work offers a foundational exploration of LLMs in data integration and outlines future directions for improving efficiency, scalability, and robustness in real-world application.
Analysis and Application of Large Language Models in Data Integration.
RISTORI, PAOLO
2024/2025
Abstract
With the increasing volume and variety of data generated across systems, integrating information from multiple and heterogeneous sources has become a crucial task for many organizations. This thesis investigates the use of Large Language Models (LLMs) for automating key tasks in data integration, including schema matching, entity resolution, and data fusion. By leveraging the contextual understanding and generalization capabilities of state-of-the-art models such as GPT-4.1, the study demonstrates that LLMs can outperform several established approaches with the absence or minimal supervision. A particular focus is placed on the scalability challenges of entity matching, addressed through a custom blocking mechanism and a cost-efficient three-step LLM pipeline that reduces resource consumption by 70% without compromising accuracy. The data fusion stage further highlights the ability of LLMs to resolve conflicts and synthesize reliable values using semantic reasoning and context enhanced by Retrieval Augmented Generation (RAG). The system is deployed within a modular multi-agent architecture, promoting automation while ensuring user control and transparency. While the results are promising, limitations such as high computational costs, inference latency, and the non-deterministic nature of LLMs pose challenges to their industrial adoption. This work offers a foundational exploration of LLMs in data integration and outlines future directions for improving efficiency, scalability, and robustness in real-world application.| File | Dimensione | Formato | |
|---|---|---|---|
|
Ristori.Paolo.pdf
Accesso riservato
Dimensione
1.28 MB
Formato
Adobe PDF
|
1.28 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14251/3658