Introduction

Large Language Models (LLMs) have demonstrated impressive capabilities, yet their mathematical reasoning often remains unreliable. This project explores how reinforcement learning fine-tuning can improve language model reasoning by training Generative Reward Models (GenRM) that verify solutions using Chain-of-Thought (CoT) reasoning.

We investigate a three-stage pipeline — Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and GenRM — applied to Qwen 2.5 0.5B, targeting mathematical problem-solving on the GSM8K and MATH benchmarks.

Background

Generative Reward Models

Traditional reward models in RLHF assign scalar scores to outputs. GenRM instead frames verification as a next-token prediction task — the model generates a CoT verification trace explaining why a solution is correct or incorrect, then emits a final judgment. This approach leverages the model's own reasoning capabilities for self-verification.

Direct Preference Optimization

DPO simplifies RLHF by directly optimizing the policy from preference data without training a separate reward model. Given preferred and dispreferred response pairs, DPO adjusts the model to increase the likelihood of preferred responses relative to dispreferred ones:

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right]

Methodology

Training Pipeline

Our three-stage approach:

Stage 1 — SFT: Fine-tune Qwen 2.5 0.5B on curated math reasoning demonstrations to establish a strong baseline for structured mathematical output.
Stage 2 — DPO: Train on preference pairs where correct solutions are preferred over incorrect ones, teaching the model to distinguish quality reasoning.
Stage 3 — GenRM-CoT: Train the model to generate verification traces — given a problem and candidate solution, produce a CoT explanation evaluating correctness, then output a binary judgment.

Data Pipeline

SFT Data: Curated mathematical reasoning demonstrations with step-by-step solutions
DPO Pairs: Generated by sampling multiple solutions per problem, pairing correct solutions (verified by ground truth) against incorrect ones
GenRM Training Data: Problem-solution pairs annotated with verification CoT traces and correctness labels

Infrastructure

Training was conducted on NVIDIA H100 GPUs using:

Ray.io for distributed data parallelism (DDP) across multiple GPUs
FP8 mixed-precision training for memory efficiency on H100 architecture
DeepSpeed ZeRO for optimizer state partitioning
Custom checkpointing with Google Cloud Storage for fault tolerance

Experiments

Evaluation Setup

Models were evaluated using:

GSM8K: Grade school math word problems (8.5K test examples)
MATH: Competition-level mathematics (5K test examples)
Win-rate: Head-to-head comparison against the base Qwen 2.5 0.5B Instruct model using GPT-4 as judge

Results

Model	GSM8K Acc.	MATH Acc.	Win Rate vs Base
Qwen 2.5 0.5B Instruct (Base)	—	—	—
+ SFT	Improved	Improved	38%
+ SFT + DPO	Improved	Improved	45%
+ SFT + DPO + GenRM	Improved	Improved	42%

Key finding: DPO alone achieved the strongest win rate of 45% against the base Instruct model. The GenRM stage, while producing more structured verification outputs, did not consistently improve over DPO in automated evaluations — suggesting that for small models (0.5B parameters), the verification overhead may exceed the reasoning capacity.

Analysis

DPO was the most impactful stage, likely because preference optimization directly targets the model's generation distribution toward higher-quality reasoning patterns.
GenRM showed promise in verification quality — generated CoT traces were often logically sound — but the 0.5B model struggled to simultaneously reason and verify at this scale.
FP8 training on H100s reduced memory footprint by ~40% compared to FP16 with minimal accuracy degradation, enabling efficient experimentation.

Contributions

Prabhjot Singh Rai: Cloud training infrastructure, DDP strategy with Ray.io, FP8 training pipeline, experiment orchestration
Anirban Chatterjee: Data pipeline, SFT and DPO training loops, GenRM implementation, evaluation framework

References

Qwen Team. "Qwen2.5 Technical Report." 2024.
Rafailov et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023.
Zhang et al. "Generative Verifiers: Reward Modeling as Next-Token Prediction." 2024.
Cobbe et al. "Training Verifiers to Solve Math Word Problems." 2021.