A Practical Framework for Literature-to-Dataset Construction and Structured Extraction

Polyphenol composition data are essential for nutrition research, exposure assessment, and food databases, yet quantitative evidence is overwhelmingly reported in heterogeneous scientific PDFs. This thesis addresses the practical bottleneck of turning literature into structured records by proposing an auditable and reproducible pipeline that converts papers into supervised training data for constrained information extraction. Starting from a curated seed list of Phenol-Explorer references, the pipeline resolves identifiers, retrieves available PDFs under realistic access constraints, and parses documents into a unified intermediate representation suitable for both tables and paragraphs. Evidence is then split into atomic “chunks” with stable provenance identifiers, triaged for annotation, and labeled under a strict line-based schema: food|compound|value|unit, or NONE when extraction is unsupported or unreliable. To reduce downstream ambiguity and enable deterministic evaluation, the supervision protocol enforces faithful copying of entities, values, and units as they appear in the extracted text, while explicitly incorporating negative examples to teach abstention and limit hallucinations. The resulting dataset is assembled with consistency checks, deduplication, and controlled balancing of NONE, then exported to chat-style JSONL for instruction tuning. On the modelling side, a domain-adapted extractor is obtained by parameter-efficient fine-tuning (QLoRA) of BioMistral-7B, leveraging 4-bit quantization to fit long-context training within practical hardware constraints. The outcome of this work is not a fully populated public database, but a modular framework and extractor model that enable faster human-in-the-loop curation cycles, scalable updates as new papers appear, and structured datasets for downstream analysis. The thesis also discusses key failure modes in PDF and table parsing that affect quantitative extraction reliability and motivates design choices that prioritize traceability over aggressive normalization.

A Practical Framework for Literature-to-Dataset Construction and Structured Extraction

MEOLI, ROCCO

2024/2025

Abstract

Polyphenol composition data are essential for nutrition research, exposure assessment, and food databases, yet quantitative evidence is overwhelmingly reported in heterogeneous scientific PDFs. This thesis addresses the practical bottleneck of turning literature into structured records by proposing an auditable and reproducible pipeline that converts papers into supervised training data for constrained information extraction. Starting from a curated seed list of Phenol-Explorer references, the pipeline resolves identifiers, retrieves available PDFs under realistic access constraints, and parses documents into a unified intermediate representation suitable for both tables and paragraphs. Evidence is then split into atomic “chunks” with stable provenance identifiers, triaged for annotation, and labeled under a strict line-based schema: food|compound|value|unit, or NONE when extraction is unsupported or unreliable. To reduce downstream ambiguity and enable deterministic evaluation, the supervision protocol enforces faithful copying of entities, values, and units as they appear in the extracted text, while explicitly incorporating negative examples to teach abstention and limit hallucinations. The resulting dataset is assembled with consistency checks, deduplication, and controlled balancing of NONE, then exported to chat-style JSONL for instruction tuning. On the modelling side, a domain-adapted extractor is obtained by parameter-efficient fine-tuning (QLoRA) of BioMistral-7B, leveraging 4-bit quantization to fit long-context training within practical hardware constraints. The outcome of this work is not a fully populated public database, but a modular framework and extractor model that enable faster human-in-the-loop curation cycles, scalable updates as new papers appear, and structured datasets for downstream analysis. The thesis also discusses key failure modes in PDF and table parsing that affect quantitative extraction reliability and motivates design choices that prioritize traceability over aggressive normalization.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				Polyphenols
I.E.
QLoRA
PDF Parsing
Instruction Tuning
			
	Relatore
	
				LOVINO, MARTA
			
	Controrelatore
	
				FICARRA, ELISA
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Meoli.Rocco.pdf embargo fino al 10/02/2029 Dimensione 1.38 MB Formato Adobe PDF	1.38 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4724