Improving Reasoning and Generalization in Large Language Models through a Hybrid Reward Approach with Group-Relative Policy Optimization

Recent advances in Large Language Models (LLMs) have highlighted their strong generative capabilities, while also revealing persistent challenges in aligning model outputs with structured reasoning and generalization requirements. Reinforcement Learning from Human Feedback (RLHF) has emerged as an effective approach for addressing these challenges, yet widely adopted methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) often suffer from instability, sensitivity to reward design, or limited applicability to long-horizon tasks. Group-Relative Policy Optimization (GRPO) has been proposed as an alternative reinforcement learning method that improves training stability by normalizing rewards across groups of model-generated completions. This thesis investigates the effectiveness of GRPO in enhancing reasoning capabilities and generalization in large language models. Starting from a Qwen3-4B-Base model, supervised fine-tuning is first performed on a dataset of mathematical reasoning problems to establish a stable initialization. The model is then optimized using GRPO under varying training configurations, exploring the impact of dataset size and reward function design on in-domain mathematical performance. Building on these experiments, we introduce a novel hybrid reward function that combines internal and external reward signals and demonstrates superior performance compared to state-of-the-art approaches in out-of-domain generalization. Then in order to test the new model capabilities we challenge it with a difficult task: Building on this best-performing model, the study further explores the application of GRPO to multi-step API reasoning tasks, which require planning, decomposition, and structured interaction with external tools. Experimental results show that GRPO not only improves in-domain reasoning performance, but also supports generalization to unseen domains and enables effective adaptation to complex, multi-step, tool-based tasks.

Improving Reasoning and Generalization in Large Language Models through a Hybrid Reward Approach with Group-Relative Policy Optimization

POTTOCAR, EDOARDO

2024/2025

Abstract

Recent advances in Large Language Models (LLMs) have highlighted their strong generative capabilities, while also revealing persistent challenges in aligning model outputs with structured reasoning and generalization requirements. Reinforcement Learning from Human Feedback (RLHF) has emerged as an effective approach for addressing these challenges, yet widely adopted methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) often suffer from instability, sensitivity to reward design, or limited applicability to long-horizon tasks. Group-Relative Policy Optimization (GRPO) has been proposed as an alternative reinforcement learning method that improves training stability by normalizing rewards across groups of model-generated completions. This thesis investigates the effectiveness of GRPO in enhancing reasoning capabilities and generalization in large language models. Starting from a Qwen3-4B-Base model, supervised fine-tuning is first performed on a dataset of mathematical reasoning problems to establish a stable initialization. The model is then optimized using GRPO under varying training configurations, exploring the impact of dataset size and reward function design on in-domain mathematical performance. Building on these experiments, we introduce a novel hybrid reward function that combines internal and external reward signals and demonstrates superior performance compared to state-of-the-art approaches in out-of-domain generalization. Then in order to test the new model capabilities we challenge it with a difficult task: Building on this best-performing model, the study further explores the application of GRPO to multi-step API reasoning tasks, which require planning, decomposition, and structured interaction with external tools. Experimental results show that GRPO not only improves in-domain reasoning performance, but also supports generalization to unseen domains and enables effective adaptation to complex, multi-step, tool-based tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Ingegneria "Enzo Ferrari"
			
	Corso di studio
	
				Artificial intelligence engineering
			
	Anno Accademico
	
				2024
			
	Parola chiave
	
				LLM
RL
GRPO
Reasoning
Multi-Tool
			
	Relatore
	
				SIMONINI, GIOVANNI
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Pottocar.Edoardo.pdf accesso aperto Dimensione 1.98 MB Formato Adobe PDF Visualizza/Apri	1.98 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4612