Polyphenol composition data are essential for nutrition research, exposure assessment, and food databases, yet quantitative evidence is overwhelmingly reported in heterogeneous scientific PDFs. This thesis addresses the practical bottleneck of turning literature into structured records by proposing an auditable and reproducible pipeline that converts papers into supervised training data for constrained information extraction. Starting from a curated seed list of Phenol-Explorer references, the pipeline resolves identifiers, retrieves available PDFs under realistic access constraints, and parses documents into a unified intermediate representation suitable for both tables and paragraphs. Evidence is then split into atomic “chunks” with stable provenance identifiers, triaged for annotation, and labeled under a strict line-based schema: food|compound|value|unit, or NONE when extraction is unsupported or unreliable. To reduce downstream ambiguity and enable deterministic evaluation, the supervision protocol enforces faithful copying of entities, values, and units as they appear in the extracted text, while explicitly incorporating negative examples to teach abstention and limit hallucinations. The resulting dataset is assembled with consistency checks, deduplication, and controlled balancing of NONE, then exported to chat-style JSONL for instruction tuning. On the modelling side, a domain-adapted extractor is obtained by parameter-efficient fine-tuning (QLoRA) of BioMistral-7B, leveraging 4-bit quantization to fit long-context training within practical hardware constraints. The outcome of this work is not a fully populated public database, but a modular framework and extractor model that enable faster human-in-the-loop curation cycles, scalable updates as new papers appear, and structured datasets for downstream analysis. The thesis also discusses key failure modes in PDF and table parsing that affect quantitative extraction reliability and motivates design choices that prioritize traceability over aggressive normalization.

A Practical Framework for Literature-to-Dataset Construction and Structured Extraction

MEOLI, ROCCO
2024/2025

Abstract

Polyphenol composition data are essential for nutrition research, exposure assessment, and food databases, yet quantitative evidence is overwhelmingly reported in heterogeneous scientific PDFs. This thesis addresses the practical bottleneck of turning literature into structured records by proposing an auditable and reproducible pipeline that converts papers into supervised training data for constrained information extraction. Starting from a curated seed list of Phenol-Explorer references, the pipeline resolves identifiers, retrieves available PDFs under realistic access constraints, and parses documents into a unified intermediate representation suitable for both tables and paragraphs. Evidence is then split into atomic “chunks” with stable provenance identifiers, triaged for annotation, and labeled under a strict line-based schema: food|compound|value|unit, or NONE when extraction is unsupported or unreliable. To reduce downstream ambiguity and enable deterministic evaluation, the supervision protocol enforces faithful copying of entities, values, and units as they appear in the extracted text, while explicitly incorporating negative examples to teach abstention and limit hallucinations. The resulting dataset is assembled with consistency checks, deduplication, and controlled balancing of NONE, then exported to chat-style JSONL for instruction tuning. On the modelling side, a domain-adapted extractor is obtained by parameter-efficient fine-tuning (QLoRA) of BioMistral-7B, leveraging 4-bit quantization to fit long-context training within practical hardware constraints. The outcome of this work is not a fully populated public database, but a modular framework and extractor model that enable faster human-in-the-loop curation cycles, scalable updates as new papers appear, and structured datasets for downstream analysis. The thesis also discusses key failure modes in PDF and table parsing that affect quantitative extraction reliability and motivates design choices that prioritize traceability over aggressive normalization.
2024
Polyphenols
I.E.
QLoRA
PDF Parsing
Instruction Tuning
File in questo prodotto:
File Dimensione Formato  
Meoli.Rocco.pdf

embargo fino al 10/02/2029

Dimensione 1.38 MB
Formato Adobe PDF
1.38 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4724