Introduction
Large Language Models (LLMs) have demonstrated impressive capabilities, yet their mathematical reasoning often remains unreliable. This project explores how reinforcement learning fine-tuning can improve language model reasoning by training Generative Reward Models (GenRM) that verify solutions using Chain-of-Thought (CoT) reasoning.
We investigate a three-stage pipeline — Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and GenRM — applied to Qwen 2.5 0.5B, targeting mathematical problem-solving on the GSM8K and MATH benchmarks.
Background
Generative Reward Models
Traditional reward models in RLHF assign scalar scores to outputs. GenRM instead frames verification as a next-token prediction task — the model generates a CoT verification trace explaining why a solution is correct or incorrect, then emits a final judgment. This approach leverages the model's own reasoning capabilities for self-verification.
Direct Preference Optimization
DPO simplifies RLHF by directly optimizing the policy from preference data without training a separate reward model. Given preferred and dispreferred response pairs, DPO adjusts the model to increase the likelihood of preferred responses relative to dispreferred ones:
Methodology
Training Pipeline
Our three-stage approach:
-
Stage 1 — SFT: Fine-tune Qwen 2.5 0.5B on curated math reasoning demonstrations to establish a strong baseline for structured mathematical output.
-
Stage 2 — DPO: Train on preference pairs where correct solutions are preferred over incorrect ones, teaching the model to distinguish quality reasoning.
-
Stage 3 — GenRM-CoT: Train the model to generate verification traces — given a problem and candidate solution, produce a CoT explanation evaluating correctness, then output a binary judgment.
Data Pipeline
- SFT Data: Curated mathematical reasoning demonstrations with step-by-step solutions
- DPO Pairs: Generated by sampling multiple solutions per problem, pairing correct solutions (verified by ground truth) against incorrect ones
- GenRM Training Data: Problem-solution pairs annotated with verification CoT traces and correctness labels
Infrastructure
Training was conducted on NVIDIA H100 GPUs using:
- Ray.io for distributed data parallelism (DDP) across multiple GPUs
- FP8 mixed-precision training for memory efficiency on H100 architecture
- DeepSpeed ZeRO for optimizer state partitioning
- Custom checkpointing with Google Cloud Storage for fault tolerance
Experiments
Evaluation Setup
Models were evaluated using:
- GSM8K: Grade school math word problems (8.5K test examples)
- MATH: Competition-level mathematics (5K test examples)
- Win-rate: Head-to-head comparison against the base Qwen 2.5 0.5B Instruct model using GPT-4 as judge
Results
| Model | GSM8K Acc. | MATH Acc. | Win Rate vs Base |
|---|---|---|---|
| Qwen 2.5 0.5B Instruct (Base) | — | — | — |
| + SFT | Improved | Improved | 38% |
| + SFT + DPO | Improved | Improved | 45% |
| + SFT + DPO + GenRM | Improved | Improved | 42% |
Key finding: DPO alone achieved the strongest win rate of 45% against the base Instruct model. The GenRM stage, while producing more structured verification outputs, did not consistently improve over DPO in automated evaluations — suggesting that for small models (0.5B parameters), the verification overhead may exceed the reasoning capacity.
Analysis
- DPO was the most impactful stage, likely because preference optimization directly targets the model's generation distribution toward higher-quality reasoning patterns.
- GenRM showed promise in verification quality — generated CoT traces were often logically sound — but the 0.5B model struggled to simultaneously reason and verify at this scale.
- FP8 training on H100s reduced memory footprint by ~40% compared to FP16 with minimal accuracy degradation, enabling efficient experimentation.
Contributions
- Prabhjot Singh Rai: Cloud training infrastructure, DDP strategy with Ray.io, FP8 training pipeline, experiment orchestration
- Anirban Chatterjee: Data pipeline, SFT and DPO training loops, GenRM implementation, evaluation framework
References
- Qwen Team. "Qwen2.5 Technical Report." 2024.
- Rafailov et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023.
- Zhang et al. "Generative Verifiers: Reward Modeling as Next-Token Prediction." 2024.
- Cobbe et al. "Training Verifiers to Solve Math Word Problems." 2021.