Recent advances in Large Language Models (LLMs) have highlighted their strong generative capabilities, while also revealing persistent challenges in aligning model outputs with structured reasoning and generalization requirements. Reinforcement Learning from Human Feedback (RLHF) has emerged as an effective approach for addressing these challenges, yet widely adopted methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) often suffer from instability, sensitivity to reward design, or limited applicability to long-horizon tasks. Group-Relative Policy Optimization (GRPO) has been proposed as an alternative reinforcement learning method that improves training stability by normalizing rewards across groups of model-generated completions. This thesis investigates the effectiveness of GRPO in enhancing reasoning capabilities and generalization in large language models. Starting from a Qwen3-4B-Base model, supervised fine-tuning is first performed on a dataset of mathematical reasoning problems to establish a stable initialization. The model is then optimized using GRPO under varying training configurations, exploring the impact of dataset size and reward function design on in-domain mathematical performance. Building on these experiments, we introduce a novel hybrid reward function that combines internal and external reward signals and demonstrates superior performance compared to state-of-the-art approaches in out-of-domain generalization. Then in order to test the new model capabilities we challenge it with a difficult task: Building on this best-performing model, the study further explores the application of GRPO to multi-step API reasoning tasks, which require planning, decomposition, and structured interaction with external tools. Experimental results show that GRPO not only improves in-domain reasoning performance, but also supports generalization to unseen domains and enables effective adaptation to complex, multi-step, tool-based tasks.

Improving Reasoning and Generalization in Large Language Models through a Hybrid Reward Approach with Group-Relative Policy Optimization

POTTOCAR, EDOARDO
2024/2025

Abstract

Recent advances in Large Language Models (LLMs) have highlighted their strong generative capabilities, while also revealing persistent challenges in aligning model outputs with structured reasoning and generalization requirements. Reinforcement Learning from Human Feedback (RLHF) has emerged as an effective approach for addressing these challenges, yet widely adopted methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) often suffer from instability, sensitivity to reward design, or limited applicability to long-horizon tasks. Group-Relative Policy Optimization (GRPO) has been proposed as an alternative reinforcement learning method that improves training stability by normalizing rewards across groups of model-generated completions. This thesis investigates the effectiveness of GRPO in enhancing reasoning capabilities and generalization in large language models. Starting from a Qwen3-4B-Base model, supervised fine-tuning is first performed on a dataset of mathematical reasoning problems to establish a stable initialization. The model is then optimized using GRPO under varying training configurations, exploring the impact of dataset size and reward function design on in-domain mathematical performance. Building on these experiments, we introduce a novel hybrid reward function that combines internal and external reward signals and demonstrates superior performance compared to state-of-the-art approaches in out-of-domain generalization. Then in order to test the new model capabilities we challenge it with a difficult task: Building on this best-performing model, the study further explores the application of GRPO to multi-step API reasoning tasks, which require planning, decomposition, and structured interaction with external tools. Experimental results show that GRPO not only improves in-domain reasoning performance, but also supports generalization to unseen domains and enables effective adaptation to complex, multi-step, tool-based tasks.
2024
LLM
RL
GRPO
Reasoning
Multi-Tool
File in questo prodotto:
File Dimensione Formato  
Pottocar.Edoardo.pdf

accesso aperto

Dimensione 1.98 MB
Formato Adobe PDF
1.98 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14251/4612