Enhancing Causal Discovery with Large Language Models and Retrieval-Augmented Generation: A Metadata-Driven Approach

Causal discovery is a fundamental task in artificial intelligence, traditionally addressed through data-driven statistical methods such as PC, GES, LiNGAM, and NOTEARS. While these algorithms excel at identifying patterns in numerical data, they often struggle with interpretability and the integration of domain-specific context. Recent advancements have seen Large Language Models (LLMs) employed either as validators for statistically derived "skeleton" graphs or as standalone discovery tools leveraging their internal latent knowledge. However, these approaches remain limited by the models' parametric memory and potential hallucinations. This thesis proposes a novel pipeline for causal discovery that utilizes LLMs to extract causal graphs exclusively from metadata—specifically, the names and descriptions of variables within a given knowledge area (e.g., Asia, Cancer, Barley, Additive Manufacturing). The core innovation lies in the integration of Retrieval-Augmented Generation (RAG) to ground the model’s reasoning in specific scientific literature. By retrieving relevant context directly from the original papers associated with the Bayesian network datasets, the system enriches the initial metadata and guides the discovery process. The research explores two primary RAG-based strategies: first, using retrieved information to augment variable descriptions before prompting; and second, leveraging RAG to deduce causal relationships directly from the text by providing the model with both metadata and targeted paper snippets. Furthermore, the study evaluates the impact of different prompting strategies, including one-shot and few-shot learning, to optimize the structural accuracy of the generated graphs. Preliminary results indicate that the integration of RAG significantly improves the alignment between the extracted graphs and the ground-truth Bayesian networks, demonstrating that grounded textual evidence is a powerful complement to traditional and purely LLM-based causal discovery methods.

Enhancing Causal Discovery with Large Language Models and Retrieval-Augmented Generation: A Metadata-Driven Approach

BULGARELLI, MATTEO

2024/2025

Abstract

Causal discovery is a fundamental task in artificial intelligence, traditionally addressed through data-driven statistical methods such as PC, GES, LiNGAM, and NOTEARS. While these algorithms excel at identifying patterns in numerical data, they often struggle with interpretability and the integration of domain-specific context. Recent advancements have seen Large Language Models (LLMs) employed either as validators for statistically derived "skeleton" graphs or as standalone discovery tools leveraging their internal latent knowledge. However, these approaches remain limited by the models' parametric memory and potential hallucinations. This thesis proposes a novel pipeline for causal discovery that utilizes LLMs to extract causal graphs exclusively from metadata—specifically, the names and descriptions of variables within a given knowledge area (e.g., Asia, Cancer, Barley, Additive Manufacturing). The core innovation lies in the integration of Retrieval-Augmented Generation (RAG) to ground the model’s reasoning in specific scientific literature. By retrieving relevant context directly from the original papers associated with the Bayesian network datasets, the system enriches the initial metadata and guides the discovery process. The research explores two primary RAG-based strategies: first, using retrieved information to augment variable descriptions before prompting; and second, leveraging RAG to deduce causal relationships directly from the text by providing the model with both metadata and targeted paper snippets. Furthermore, the study evaluates the impact of different prompting strategies, including one-shot and few-shot learning, to optimize the structural accuracy of the generated graphs. Preliminary results indicate that the integration of RAG significantly improves the alignment between the extracted graphs and the ground-truth Bayesian networks, demonstrating that grounded textual evidence is a powerful complement to traditional and purely LLM-based causal discovery methods.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Enhancing Causal Discovery with Large Language Models and Retrieval-Augmented Generation: A Metadata-Driven Approach
			
	Abstract in italiano
	
				Causal discovery is a fundamental task in artificial intelligence, traditionally addressed through data-driven statistical methods such as PC, GES, LiNGAM, and NOTEARS. While these algorithms excel at identifying patterns in numerical data, they often struggle with interpretability and the integration of domain-specific context. Recent advancements have seen Large Language Models (LLMs) employed either as validators for statistically derived "skeleton" graphs or as standalone discovery tools leveraging their internal latent knowledge. However, these approaches remain limited by the models' parametric memory and potential hallucinations.

This thesis proposes a novel pipeline for causal discovery that utilizes LLMs to extract causal graphs exclusively from metadata—specifically, the names and descriptions of variables within a given knowledge area (e.g., Asia, Cancer, Barley, Additive Manufacturing). The core innovation lies in the integration of Retrieval-Augmented Generation (RAG) to ground the model’s reasoning in specific scientific literature. By retrieving relevant context directly from the original papers associated with the Bayesian network datasets, the system enriches the initial metadata and guides the discovery process.

The research explores two primary RAG-based strategies: first, using retrieved information to augment variable descriptions before prompting; and second, leveraging RAG to deduce causal relationships directly from the text by providing the model with both metadata and targeted paper snippets. Furthermore, the study evaluates the impact of different prompting strategies, including one-shot and few-shot learning, to optimize the structural accuracy of the generated graphs.

Preliminary results indicate that the integration of RAG significantly improves the alignment between the extracted graphs and the ground-truth Bayesian networks, demonstrating that grounded textual evidence is a powerful complement to traditional and purely LLM-based causal discovery methods.
			
	Parola chiave
	
				Causal Discovery
LLM
RAG
Bayesian Networks
Prompt Engineering
			
	Relatore
	
				SIMONINI, GIOVANNI
			
	Controrelatore
	
				MALAGUTI, GIOVANNI
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Bulgarelli.Matteo.pdf accesso aperto Dimensione 2.4 MB Formato Adobe PDF Visualizza/Apri	2.4 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/5321