Introduction

Small Language Models (SLMs) struggle with complex multi-step reasoning tasks despite advances in pre-training and fine-tuning. PocketSheet introduces a three-stage pipeline that enhances test-time reasoning in small models through efficient memory augmentation — enabling a 7B parameter model to dramatically improve on tasks that typically require much larger models.

Our approach combines teacher-trajectory supervised fine-tuning, cheatsheet summarization, and Group Relative Policy Optimization (GRPO) to raise Qwen-7B performance on the Game of 24 dataset from 4% to 55% on the DC-Cu (Decompose-Compose Curriculum) split.

Background

The Problem with Small Models

While large models (70B+) can leverage extensive chain-of-thought reasoning, small models lack the capacity to discover effective reasoning strategies through prompting alone. Test-time compute scaling — spending more inference-time tokens on harder problems — helps, but small models often don't know what to compute.

Game of 24

The Game of 24 is a mathematical reasoning task: given four numbers, use basic arithmetic operations (+, −, ×, ÷) to make exactly 24. Despite its apparent simplicity, it requires combinatorial search over operator and operand orderings — making it an ideal benchmark for structured reasoning.

Example: Given [1, 5, 5, 5] → Solution: 5 × (5 − 1 / 5) = 24

Methodology

Three-Stage Pipeline

Stage 1: Teacher-Trajectory SFT

We fine-tune Qwen-7B on successful reasoning trajectories generated by a stronger teacher model. The teacher provides step-by-step solutions demonstrating effective decomposition strategies for Game of 24 problems.

Full-parameter SFT on NVIDIA H100 GPUs
Training on curated teacher trajectories that demonstrate the decompose-compose strategy
Loss computed only on reasoning tokens (not input prompts)

Stage 2: Cheatsheet Summarization

The key innovation of PocketSheet: we train the model to generate compressed "cheatsheets" — concise memory artifacts that capture useful intermediate computations and patterns.

Given a problem, the model first generates a cheatsheet summarizing relevant strategies and partial computations, then uses this cheatsheet as additional context during reasoning. This acts as an efficient external memory that the model can consult during test-time inference.

p(y | x) = \sum_{c} p(y | x, c) \cdot p(c | x)

where $c$ represents the cheatsheet and $y$ the final answer.

Stage 3: GRPO (Group Relative Policy Optimization)

We apply GRPO to further refine the model's reasoning policy. GRPO extends PPO-style optimization by:

Sampling a group of $K$ responses for each problem
Computing relative advantages within the group
Optimizing the policy to increase likelihood of higher-reward responses relative to the group mean

\mathcal{L}_{\text{GRPO}} = -\mathbb{E}_{x, \{y_i\}_{i=1}^K} \left[ \sum_{i=1}^{K} \hat{A}_i \cdot \log \pi_\theta(y_i | x) \right]

The reward signal is binary: +1 if the generated expression evaluates to 24, −1 otherwise.

Infrastructure

Ray Train: Distributed training across multiple H100 GPUs with automatic scaling
Google Cloud Storage: Checkpoint persistence for fault-tolerant training
Full-parameter SFT: No LoRA or adapter methods — full fine-tuning for maximum capacity
Custom evaluation pipeline: Automated symbolic verification of Game of 24 solutions

Experiments

Dataset

DC-Cu (Decompose-Compose Curriculum): A curriculum-ordered split of Game of 24 problems arranged by difficulty
Problems range from straightforward arithmetic to complex multi-step decompositions

Results

Model	DC-Cu Accuracy
Qwen-7B (Base)	4%
+ Teacher-Trajectory SFT	32%
+ Cheatsheet Summarization	41%
+ GRPO	55%

Each stage of the pipeline contributed meaningful improvements:

SFT (+28%): Established structured reasoning capability through imitation learning
Cheatsheet (+9%): Memory augmentation provided useful intermediate context, particularly on harder problems requiring multi-step decomposition
GRPO (+14%): Reinforcement learning refined the policy by directly optimizing for correctness, moving beyond imitation to discovery of novel solution strategies

Ablation Studies

Cheatsheet length: Shorter cheatsheets (128 tokens) performed nearly as well as longer ones (512 tokens), suggesting the model learns to extract the most salient patterns efficiently
GRPO group size: Groups of $K=8$ balanced compute cost with advantage estimation quality
Teacher model choice: Larger teachers produced better SFT trajectories, but diminishing returns beyond 70B parameters

Contributions

Sakthivel Sivaraman: Teacher-trajectory data generation, cheatsheet summarization methodology, GRPO reward design
Prabhjot Singh Rai: Cloud training infrastructure (Ray Train, GCS checkpointing), cheatsheet summarization support, evaluation pipeline, full-parameter SFT on H100, GRPO implementation, model serving

References

Qwen Team. "Qwen2.5 Technical Report." 2024.
Shao et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." 2024.
Yao et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023.
Lightman et al. "Let's Verify Step by Step." ICLR 2024.