In this thesis, we investigate the design factors that shape the effectiveness of Retrieval-Augmented Generation (RAG) systems, with a particular focus on the interplay between chunking strategies, retrieval configurations, and enrichment mechanisms. The study aims to improve both the performance and adaptability of RAG pipelines through a systematic set of comparative experiments. We begin by implementing and evaluating a wide spectrum of chunking strategies, ranging from simple fixed-size segmentation to more advanced approaches that incorporate structural and semantic information. Each method is analyzed in terms of its downstream impact on retrieval, with specific attention to how chunk boundaries influence the relevance of the results. In parallel, we examine different retrieval configurations, including dense semantic search, hybrid models that combine sparse and dense signals, and recent methods such as HyDE and RAG Fusion. In addition to retrieval, we investigate enrichment mechanisms designed to improve the retrievability of individual chunks by appending contextual or metadata-related information. Despite their apparent simplicity, these augmentations frequently yield measurable improvements, underlining the importance of information framing in retrieval performance. To capture the interactions between chunking, retrieval, and enrichment, we conduct a series of empirical evaluations under varying conditions. Rather than seeking a universally optimal configuration, the objective is to characterize the trade-offs and contextual dependencies that emerge across different settings. The findings indicate that performance is often contingent on task-specific and structural factors, and that the most effective results are achieved when strategies are applied in a targeted and adaptive manner. Ultimately, this work contributes practical insights and guidelines for designing robust, efficient, and context-aware RAG pipelines.

Optimization of Retrieval-Augmented Generation Pipelines: A Comparative Analysis of Chunking, Retrieval, and Enrichment Strategies

IMAD, AYOUB
2024/2025

Abstract

In this thesis, we investigate the design factors that shape the effectiveness of Retrieval-Augmented Generation (RAG) systems, with a particular focus on the interplay between chunking strategies, retrieval configurations, and enrichment mechanisms. The study aims to improve both the performance and adaptability of RAG pipelines through a systematic set of comparative experiments. We begin by implementing and evaluating a wide spectrum of chunking strategies, ranging from simple fixed-size segmentation to more advanced approaches that incorporate structural and semantic information. Each method is analyzed in terms of its downstream impact on retrieval, with specific attention to how chunk boundaries influence the relevance of the results. In parallel, we examine different retrieval configurations, including dense semantic search, hybrid models that combine sparse and dense signals, and recent methods such as HyDE and RAG Fusion. In addition to retrieval, we investigate enrichment mechanisms designed to improve the retrievability of individual chunks by appending contextual or metadata-related information. Despite their apparent simplicity, these augmentations frequently yield measurable improvements, underlining the importance of information framing in retrieval performance. To capture the interactions between chunking, retrieval, and enrichment, we conduct a series of empirical evaluations under varying conditions. Rather than seeking a universally optimal configuration, the objective is to characterize the trade-offs and contextual dependencies that emerge across different settings. The findings indicate that performance is often contingent on task-specific and structural factors, and that the most effective results are achieved when strategies are applied in a targeted and adaptive manner. Ultimately, this work contributes practical insights and guidelines for designing robust, efficient, and context-aware RAG pipelines.
2024
RAG
Chunking
Retrieval
Enrichment
GenAI
File in questo prodotto:
File Dimensione Formato  
Imad.Ayoub.pdf

Accesso riservato

Dimensione 10.72 MB
Formato Adobe PDF
10.72 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/3679