2

RL Fine-Tuning of Language Models with GenRM-CoT

Applying reinforcement learning techniques to improve language model reasoning through Generative Reward Models with Chain-of-Thought verification

Introduction

Large Language Models (LLMs) have demonstrated impressive capabilities, yet their mathematical reasoning often remains unreliable. This project explores how reinforcement learning fine-tuning can improve language model reasoning by training Generative Reward Models (GenRM) that verify solutions using Chain-of-Thought (CoT) reasoning.

We investigate a three-stage pipeline — Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and GenRM — applied to Qwen 2.5 0.5B, targeting mathematical problem-solving on the GSM8K and MATH benchmarks.

Background

Generative Reward Models

Traditional reward models in RLHF assign scalar scores to outputs. GenRM instead frames verification as a next-token prediction task — the model generates a CoT verification trace explaining why a solution is correct or incorrect, then emits a final judgment. This approach leverages the model's own reasoning capabilities for self-verification.

Direct Preference Optimization

DPO simplifies RLHF by directly optimizing the policy from preference data without training a separate reward model. Given preferred and dispreferred response pairs, DPO adjusts the model to increase the likelihood of preferred responses relative to dispreferred ones:

LDPO(πθ;πref)=E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right]

Methodology

Training Pipeline

Our three-stage approach:

  1. Stage 1 — SFT: Fine-tune Qwen 2.5 0.5B on curated math reasoning demonstrations to establish a strong baseline for structured mathematical output.

  2. Stage 2 — DPO: Train on preference pairs where correct solutions are preferred over incorrect ones, teaching the model to distinguish quality reasoning.

  3. Stage 3 — GenRM-CoT: Train the model to generate verification traces — given a problem and candidate solution, produce a CoT explanation evaluating correctness, then output a binary judgment.

Data Pipeline

  • SFT Data: Curated mathematical reasoning demonstrations with step-by-step solutions
  • DPO Pairs: Generated by sampling multiple solutions per problem, pairing correct solutions (verified by ground truth) against incorrect ones
  • GenRM Training Data: Problem-solution pairs annotated with verification CoT traces and correctness labels

Infrastructure

Training was conducted on NVIDIA H100 GPUs using:

  • Ray.io for distributed data parallelism (DDP) across multiple GPUs
  • FP8 mixed-precision training for memory efficiency on H100 architecture
  • DeepSpeed ZeRO for optimizer state partitioning
  • Custom checkpointing with Google Cloud Storage for fault tolerance

Experiments

Evaluation Setup

Models were evaluated using:

  • GSM8K: Grade school math word problems (8.5K test examples)
  • MATH: Competition-level mathematics (5K test examples)
  • Win-rate: Head-to-head comparison against the base Qwen 2.5 0.5B Instruct model using GPT-4 as judge

Results

ModelGSM8K Acc.MATH Acc.Win Rate vs Base
Qwen 2.5 0.5B Instruct (Base)
+ SFTImprovedImproved38%
+ SFT + DPOImprovedImproved45%
+ SFT + DPO + GenRMImprovedImproved42%

Key finding: DPO alone achieved the strongest win rate of 45% against the base Instruct model. The GenRM stage, while producing more structured verification outputs, did not consistently improve over DPO in automated evaluations — suggesting that for small models (0.5B parameters), the verification overhead may exceed the reasoning capacity.

Analysis

  • DPO was the most impactful stage, likely because preference optimization directly targets the model's generation distribution toward higher-quality reasoning patterns.
  • GenRM showed promise in verification quality — generated CoT traces were often logically sound — but the 0.5B model struggled to simultaneously reason and verify at this scale.
  • FP8 training on H100s reduced memory footprint by ~40% compared to FP16 with minimal accuracy degradation, enabling efficient experimentation.

Contributions

  • Prabhjot Singh Rai: Cloud training infrastructure, DDP strategy with Ray.io, FP8 training pipeline, experiment orchestration
  • Anirban Chatterjee: Data pipeline, SFT and DPO training loops, GenRM implementation, evaluation framework

References

  1. Qwen Team. "Qwen2.5 Technical Report." 2024.
  2. Rafailov et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023.
  3. Zhang et al. "Generative Verifiers: Reward Modeling as Next-Token Prediction." 2024.
  4. Cobbe et al. "Training Verifiers to Solve Math Word Problems." 2021.