The digitalization of the agricultural sector represents one of the strategic pillars of European policies, aimed at increasing production effciency and environmental sustainability through the adoption of advanced technologies. In this context, the AGRARIAN project, funded by the Emilia-Romagna Region within the PR-FESR 2021-2027 program, was launched. The project is focused on the development of an integrated system for precision agriculture based on the cooperation between Unmanned Aerial Vehicles (UAVs) and mobile ground units for vineyard monitoring. The system integrates sensors for measuring environmental and soil physico-chemical parameters, vegetation indices, algorithms for aerial image processing, and machine learning models. These tools allow the classifcation of crop health status, the prediction of yield, and the modeling of soil condition evolution. This work is structured in two main phases. The first phase concerns the analysis of data collected during three different periods from soils in the municipalities of Fabbrico and Mandrio, in the province of Reggio Emilia, Emilia-Romagna. The data include reflectance measurements across various wavelengths, obtained through visible (Specim) and near-infrared (NIR) spectroscopy, as well as chemical properties determined in the laboratory on the same samples. In this phase, following a literature review on the most relevant wavelengths for soil analysis and an initial reorganization of the datasets in Excel, an exploratory data analysis (EDA) was conducted in Python, following a typical data scientist approach. This analysis allowed the description of dataset characteristics in terms of central tendency and dispersion, the identifcation of significant correlations both positive and negative between chemical properties and between these properties and refectance wavelengths, to verify the consistency of the collected data with the literature, and the detection of potential outliers. Some anomalous values were subsequently treated to reduce distortions in the data distribution. The second phase of the study focused on the development of machine learning models to predict soil chemical properties. The models were trained using the European LUCAS 2015 dataset, appropriately filtered, and tested on the datasets acquired in situ and obtained after the first phase of analysis and cleaning. Several predictive algorithms were compared, including Partial Least Squares Regression (PLSR), Random Forest (RF), and Gradient Boosting (GB). Models were first trained and tested on the LUCAS benchmark dataset, achieving high predictive accuracy, particularly with PLSR and GB. To evaluate their generalization to real-world conditions, the same models were then trained on LUCAS and tested on soil samples collected in situ. While performance remained high on the benchmark dataset, predictive capability on the in situ test set was limited, with many R2 values negative and RPD values generally below 1.2, indicating that predictions often performed worse than using the mean of observed values. This drastic drop in performance was primarily due to the presence of moisture in the in situ soil samples, which signicantly altered the spectral measurements and caused the models' predictions to be unreliable. PLSR maintained relatively lower errors for pH, organic carbon, and potassium, while GB provided more stable predictions across variables. RF exhibited higher variability and occasional extreme errors. These results highlight both the potential of benchmark-trained models and the challenges in transferring predictive performance to independent, real-world datasets, suggesting that additional preprocessing or alternative modeling strategies may be required for reliable quantitative soil predictions

Data Analysis and Predictive Modeling of Vineyard Soil Properties Using Machine Learning Analisi dei dati e modellazione predittiva delle proprietà del suolo viticolo mediante tecniche di machine learning

GAMBARELLI, FRANCESCO
2024/2025

Abstract

The digitalization of the agricultural sector represents one of the strategic pillars of European policies, aimed at increasing production effciency and environmental sustainability through the adoption of advanced technologies. In this context, the AGRARIAN project, funded by the Emilia-Romagna Region within the PR-FESR 2021-2027 program, was launched. The project is focused on the development of an integrated system for precision agriculture based on the cooperation between Unmanned Aerial Vehicles (UAVs) and mobile ground units for vineyard monitoring. The system integrates sensors for measuring environmental and soil physico-chemical parameters, vegetation indices, algorithms for aerial image processing, and machine learning models. These tools allow the classifcation of crop health status, the prediction of yield, and the modeling of soil condition evolution. This work is structured in two main phases. The first phase concerns the analysis of data collected during three different periods from soils in the municipalities of Fabbrico and Mandrio, in the province of Reggio Emilia, Emilia-Romagna. The data include reflectance measurements across various wavelengths, obtained through visible (Specim) and near-infrared (NIR) spectroscopy, as well as chemical properties determined in the laboratory on the same samples. In this phase, following a literature review on the most relevant wavelengths for soil analysis and an initial reorganization of the datasets in Excel, an exploratory data analysis (EDA) was conducted in Python, following a typical data scientist approach. This analysis allowed the description of dataset characteristics in terms of central tendency and dispersion, the identifcation of significant correlations both positive and negative between chemical properties and between these properties and refectance wavelengths, to verify the consistency of the collected data with the literature, and the detection of potential outliers. Some anomalous values were subsequently treated to reduce distortions in the data distribution. The second phase of the study focused on the development of machine learning models to predict soil chemical properties. The models were trained using the European LUCAS 2015 dataset, appropriately filtered, and tested on the datasets acquired in situ and obtained after the first phase of analysis and cleaning. Several predictive algorithms were compared, including Partial Least Squares Regression (PLSR), Random Forest (RF), and Gradient Boosting (GB). Models were first trained and tested on the LUCAS benchmark dataset, achieving high predictive accuracy, particularly with PLSR and GB. To evaluate their generalization to real-world conditions, the same models were then trained on LUCAS and tested on soil samples collected in situ. While performance remained high on the benchmark dataset, predictive capability on the in situ test set was limited, with many R2 values negative and RPD values generally below 1.2, indicating that predictions often performed worse than using the mean of observed values. This drastic drop in performance was primarily due to the presence of moisture in the in situ soil samples, which signicantly altered the spectral measurements and caused the models' predictions to be unreliable. PLSR maintained relatively lower errors for pH, organic carbon, and potassium, while GB provided more stable predictions across variables. RF exhibited higher variability and occasional extreme errors. These results highlight both the potential of benchmark-trained models and the challenges in transferring predictive performance to independent, real-world datasets, suggesting that additional preprocessing or alternative modeling strategies may be required for reliable quantitative soil predictions
2024
machine learning
data analysis
soil data
vineyard
LUCAS
File in questo prodotto:
File Dimensione Formato  
Gambarelli.Francesco.pdf

embargo fino al 17/02/2027

Dimensione 11.73 MB
Formato Adobe PDF
11.73 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4821