Causal discovery is a fundamental task in artificial intelligence, traditionally addressed through data-driven statistical methods such as PC, GES, LiNGAM, and NOTEARS. While these algorithms excel at identifying patterns in numerical data, they often struggle with interpretability and the integration of domain-specific context. Recent advancements have seen Large Language Models (LLMs) employed either as validators for statistically derived "skeleton" graphs or as standalone discovery tools leveraging their internal latent knowledge. However, these approaches remain limited by the models' parametric memory and potential hallucinations. This thesis proposes a novel pipeline for causal discovery that utilizes LLMs to extract causal graphs exclusively from metadata—specifically, the names and descriptions of variables within a given knowledge area (e.g., Asia, Cancer, Barley, Additive Manufacturing). The core innovation lies in the integration of Retrieval-Augmented Generation (RAG) to ground the model’s reasoning in specific scientific literature. By retrieving relevant context directly from the original papers associated with the Bayesian network datasets, the system enriches the initial metadata and guides the discovery process. The research explores two primary RAG-based strategies: first, using retrieved information to augment variable descriptions before prompting; and second, leveraging RAG to deduce causal relationships directly from the text by providing the model with both metadata and targeted paper snippets. Furthermore, the study evaluates the impact of different prompting strategies, including one-shot and few-shot learning, to optimize the structural accuracy of the generated graphs. Preliminary results indicate that the integration of RAG significantly improves the alignment between the extracted graphs and the ground-truth Bayesian networks, demonstrating that grounded textual evidence is a powerful complement to traditional and purely LLM-based causal discovery methods.

Causal discovery is a fundamental task in artificial intelligence, traditionally addressed through data-driven statistical methods such as PC, GES, LiNGAM, and NOTEARS. While these algorithms excel at identifying patterns in numerical data, they often struggle with interpretability and the integration of domain-specific context. Recent advancements have seen Large Language Models (LLMs) employed either as validators for statistically derived "skeleton" graphs or as standalone discovery tools leveraging their internal latent knowledge. However, these approaches remain limited by the models' parametric memory and potential hallucinations. This thesis proposes a novel pipeline for causal discovery that utilizes LLMs to extract causal graphs exclusively from metadata—specifically, the names and descriptions of variables within a given knowledge area (e.g., Asia, Cancer, Barley, Additive Manufacturing). The core innovation lies in the integration of Retrieval-Augmented Generation (RAG) to ground the model’s reasoning in specific scientific literature. By retrieving relevant context directly from the original papers associated with the Bayesian network datasets, the system enriches the initial metadata and guides the discovery process. The research explores two primary RAG-based strategies: first, using retrieved information to augment variable descriptions before prompting; and second, leveraging RAG to deduce causal relationships directly from the text by providing the model with both metadata and targeted paper snippets. Furthermore, the study evaluates the impact of different prompting strategies, including one-shot and few-shot learning, to optimize the structural accuracy of the generated graphs. Preliminary results indicate that the integration of RAG significantly improves the alignment between the extracted graphs and the ground-truth Bayesian networks, demonstrating that grounded textual evidence is a powerful complement to traditional and purely LLM-based causal discovery methods.

Enhancing Causal Discovery with Large Language Models and Retrieval-Augmented Generation: A Metadata-Driven Approach

BULGARELLI, MATTEO
2024/2025

Abstract

Causal discovery is a fundamental task in artificial intelligence, traditionally addressed through data-driven statistical methods such as PC, GES, LiNGAM, and NOTEARS. While these algorithms excel at identifying patterns in numerical data, they often struggle with interpretability and the integration of domain-specific context. Recent advancements have seen Large Language Models (LLMs) employed either as validators for statistically derived "skeleton" graphs or as standalone discovery tools leveraging their internal latent knowledge. However, these approaches remain limited by the models' parametric memory and potential hallucinations. This thesis proposes a novel pipeline for causal discovery that utilizes LLMs to extract causal graphs exclusively from metadata—specifically, the names and descriptions of variables within a given knowledge area (e.g., Asia, Cancer, Barley, Additive Manufacturing). The core innovation lies in the integration of Retrieval-Augmented Generation (RAG) to ground the model’s reasoning in specific scientific literature. By retrieving relevant context directly from the original papers associated with the Bayesian network datasets, the system enriches the initial metadata and guides the discovery process. The research explores two primary RAG-based strategies: first, using retrieved information to augment variable descriptions before prompting; and second, leveraging RAG to deduce causal relationships directly from the text by providing the model with both metadata and targeted paper snippets. Furthermore, the study evaluates the impact of different prompting strategies, including one-shot and few-shot learning, to optimize the structural accuracy of the generated graphs. Preliminary results indicate that the integration of RAG significantly improves the alignment between the extracted graphs and the ground-truth Bayesian networks, demonstrating that grounded textual evidence is a powerful complement to traditional and purely LLM-based causal discovery methods.
2024
Enhancing Causal Discovery with Large Language Models and Retrieval-Augmented Generation: A Metadata-Driven Approach
Causal discovery is a fundamental task in artificial intelligence, traditionally addressed through data-driven statistical methods such as PC, GES, LiNGAM, and NOTEARS. While these algorithms excel at identifying patterns in numerical data, they often struggle with interpretability and the integration of domain-specific context. Recent advancements have seen Large Language Models (LLMs) employed either as validators for statistically derived "skeleton" graphs or as standalone discovery tools leveraging their internal latent knowledge. However, these approaches remain limited by the models' parametric memory and potential hallucinations. This thesis proposes a novel pipeline for causal discovery that utilizes LLMs to extract causal graphs exclusively from metadata—specifically, the names and descriptions of variables within a given knowledge area (e.g., Asia, Cancer, Barley, Additive Manufacturing). The core innovation lies in the integration of Retrieval-Augmented Generation (RAG) to ground the model’s reasoning in specific scientific literature. By retrieving relevant context directly from the original papers associated with the Bayesian network datasets, the system enriches the initial metadata and guides the discovery process. The research explores two primary RAG-based strategies: first, using retrieved information to augment variable descriptions before prompting; and second, leveraging RAG to deduce causal relationships directly from the text by providing the model with both metadata and targeted paper snippets. Furthermore, the study evaluates the impact of different prompting strategies, including one-shot and few-shot learning, to optimize the structural accuracy of the generated graphs. Preliminary results indicate that the integration of RAG significantly improves the alignment between the extracted graphs and the ground-truth Bayesian networks, demonstrating that grounded textual evidence is a powerful complement to traditional and purely LLM-based causal discovery methods.
Causal Discovery
LLM
RAG
Bayesian Networks
Prompt Engineering
File in questo prodotto:
File Dimensione Formato  
Bulgarelli.Matteo.pdf

accesso aperto

Dimensione 2.4 MB
Formato Adobe PDF
2.4 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/5321