Introduction
Small Language Models (SLMs) struggle with complex multi-step reasoning tasks despite advances in pre-training and fine-tuning. PocketSheet introduces a three-stage pipeline that enhances test-time reasoning in small models through efficient memory augmentation — enabling a 7B parameter model to dramatically improve on tasks that typically require much larger models.
Our approach combines teacher-trajectory supervised fine-tuning, cheatsheet summarization, and Group Relative Policy Optimization (GRPO) to raise Qwen-7B performance on the Game of 24 dataset from 4% to 55% on the DC-Cu (Decompose-Compose Curriculum) split.
Background
The Problem with Small Models
While large models (70B+) can leverage extensive chain-of-thought reasoning, small models lack the capacity to discover effective reasoning strategies through prompting alone. Test-time compute scaling — spending more inference-time tokens on harder problems — helps, but small models often don't know what to compute.
Game of 24
The Game of 24 is a mathematical reasoning task: given four numbers, use basic arithmetic operations (+, −, ×, ÷) to make exactly 24. Despite its apparent simplicity, it requires combinatorial search over operator and operand orderings — making it an ideal benchmark for structured reasoning.
Example: Given [1, 5, 5, 5] → Solution: 5 × (5 − 1 / 5) = 24
Methodology
Three-Stage Pipeline
Stage 1: Teacher-Trajectory SFT
We fine-tune Qwen-7B on successful reasoning trajectories generated by a stronger teacher model. The teacher provides step-by-step solutions demonstrating effective decomposition strategies for Game of 24 problems.
- Full-parameter SFT on NVIDIA H100 GPUs
- Training on curated teacher trajectories that demonstrate the decompose-compose strategy
- Loss computed only on reasoning tokens (not input prompts)
Stage 2: Cheatsheet Summarization
The key innovation of PocketSheet: we train the model to generate compressed "cheatsheets" — concise memory artifacts that capture useful intermediate computations and patterns.
Given a problem, the model first generates a cheatsheet summarizing relevant strategies and partial computations, then uses this cheatsheet as additional context during reasoning. This acts as an efficient external memory that the model can consult during test-time inference.
where represents the cheatsheet and the final answer.
Stage 3: GRPO (Group Relative Policy Optimization)
We apply GRPO to further refine the model's reasoning policy. GRPO extends PPO-style optimization by:
- Sampling a group of responses for each problem
- Computing relative advantages within the group
- Optimizing the policy to increase likelihood of higher-reward responses relative to the group mean
The reward signal is binary: +1 if the generated expression evaluates to 24, −1 otherwise.
Infrastructure
- Ray Train: Distributed training across multiple H100 GPUs with automatic scaling
- Google Cloud Storage: Checkpoint persistence for fault-tolerant training
- Full-parameter SFT: No LoRA or adapter methods — full fine-tuning for maximum capacity
- Custom evaluation pipeline: Automated symbolic verification of Game of 24 solutions
Experiments
Dataset
- DC-Cu (Decompose-Compose Curriculum): A curriculum-ordered split of Game of 24 problems arranged by difficulty
- Problems range from straightforward arithmetic to complex multi-step decompositions
Results
| Model | DC-Cu Accuracy |
|---|---|
| Qwen-7B (Base) | 4% |
| + Teacher-Trajectory SFT | 32% |
| + Cheatsheet Summarization | 41% |
| + GRPO | 55% |
Each stage of the pipeline contributed meaningful improvements:
- SFT (+28%): Established structured reasoning capability through imitation learning
- Cheatsheet (+9%): Memory augmentation provided useful intermediate context, particularly on harder problems requiring multi-step decomposition
- GRPO (+14%): Reinforcement learning refined the policy by directly optimizing for correctness, moving beyond imitation to discovery of novel solution strategies
Ablation Studies
- Cheatsheet length: Shorter cheatsheets (128 tokens) performed nearly as well as longer ones (512 tokens), suggesting the model learns to extract the most salient patterns efficiently
- GRPO group size: Groups of balanced compute cost with advantage estimation quality
- Teacher model choice: Larger teachers produced better SFT trajectories, but diminishing returns beyond 70B parameters
Contributions
- Sakthivel Sivaraman: Teacher-trajectory data generation, cheatsheet summarization methodology, GRPO reward design
- Prabhjot Singh Rai: Cloud training infrastructure (Ray Train, GCS checkpointing), cheatsheet summarization support, evaluation pipeline, full-parameter SFT on H100, GRPO implementation, model serving
References
- Qwen Team. "Qwen2.5 Technical Report." 2024.
- Shao et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." 2024.
- Yao et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023.
- Lightman et al. "Let's Verify Step by Step." ICLR 2024.