2

PocketSheet: Enhancing Test-Time Learning with Memory Augmentation

Improving small language model reasoning through efficient memory augmentation using teacher trajectories and cheatsheet summarization

Introduction

Small Language Models (SLMs) struggle with complex multi-step reasoning tasks despite advances in pre-training and fine-tuning. PocketSheet introduces a three-stage pipeline that enhances test-time reasoning in small models through efficient memory augmentation — enabling a 7B parameter model to dramatically improve on tasks that typically require much larger models.

Our approach combines teacher-trajectory supervised fine-tuning, cheatsheet summarization, and Group Relative Policy Optimization (GRPO) to raise Qwen-7B performance on the Game of 24 dataset from 4% to 55% on the DC-Cu (Decompose-Compose Curriculum) split.

Background

The Problem with Small Models

While large models (70B+) can leverage extensive chain-of-thought reasoning, small models lack the capacity to discover effective reasoning strategies through prompting alone. Test-time compute scaling — spending more inference-time tokens on harder problems — helps, but small models often don't know what to compute.

Game of 24

The Game of 24 is a mathematical reasoning task: given four numbers, use basic arithmetic operations (+, −, ×, ÷) to make exactly 24. Despite its apparent simplicity, it requires combinatorial search over operator and operand orderings — making it an ideal benchmark for structured reasoning.

Example: Given [1, 5, 5, 5] → Solution: 5 × (5 − 1 / 5) = 24

Methodology

Three-Stage Pipeline

Stage 1: Teacher-Trajectory SFT

We fine-tune Qwen-7B on successful reasoning trajectories generated by a stronger teacher model. The teacher provides step-by-step solutions demonstrating effective decomposition strategies for Game of 24 problems.

  • Full-parameter SFT on NVIDIA H100 GPUs
  • Training on curated teacher trajectories that demonstrate the decompose-compose strategy
  • Loss computed only on reasoning tokens (not input prompts)

Stage 2: Cheatsheet Summarization

The key innovation of PocketSheet: we train the model to generate compressed "cheatsheets" — concise memory artifacts that capture useful intermediate computations and patterns.

Given a problem, the model first generates a cheatsheet summarizing relevant strategies and partial computations, then uses this cheatsheet as additional context during reasoning. This acts as an efficient external memory that the model can consult during test-time inference.

p(yx)=cp(yx,c)p(cx)p(y | x) = \sum_{c} p(y | x, c) \cdot p(c | x)

where cc represents the cheatsheet and yy the final answer.

Stage 3: GRPO (Group Relative Policy Optimization)

We apply GRPO to further refine the model's reasoning policy. GRPO extends PPO-style optimization by:

  1. Sampling a group of KK responses for each problem
  2. Computing relative advantages within the group
  3. Optimizing the policy to increase likelihood of higher-reward responses relative to the group mean
LGRPO=Ex,{yi}i=1K[i=1KA^ilogπθ(yix)]\mathcal{L}_{\text{GRPO}} = -\mathbb{E}_{x, \{y_i\}_{i=1}^K} \left[ \sum_{i=1}^{K} \hat{A}_i \cdot \log \pi_\theta(y_i | x) \right]

The reward signal is binary: +1 if the generated expression evaluates to 24, −1 otherwise.

Infrastructure

  • Ray Train: Distributed training across multiple H100 GPUs with automatic scaling
  • Google Cloud Storage: Checkpoint persistence for fault-tolerant training
  • Full-parameter SFT: No LoRA or adapter methods — full fine-tuning for maximum capacity
  • Custom evaluation pipeline: Automated symbolic verification of Game of 24 solutions

Experiments

Dataset

  • DC-Cu (Decompose-Compose Curriculum): A curriculum-ordered split of Game of 24 problems arranged by difficulty
  • Problems range from straightforward arithmetic to complex multi-step decompositions

Results

ModelDC-Cu Accuracy
Qwen-7B (Base)4%
+ Teacher-Trajectory SFT32%
+ Cheatsheet Summarization41%
+ GRPO55%

Each stage of the pipeline contributed meaningful improvements:

  • SFT (+28%): Established structured reasoning capability through imitation learning
  • Cheatsheet (+9%): Memory augmentation provided useful intermediate context, particularly on harder problems requiring multi-step decomposition
  • GRPO (+14%): Reinforcement learning refined the policy by directly optimizing for correctness, moving beyond imitation to discovery of novel solution strategies

Ablation Studies

  • Cheatsheet length: Shorter cheatsheets (128 tokens) performed nearly as well as longer ones (512 tokens), suggesting the model learns to extract the most salient patterns efficiently
  • GRPO group size: Groups of K=8K=8 balanced compute cost with advantage estimation quality
  • Teacher model choice: Larger teachers produced better SFT trajectories, but diminishing returns beyond 70B parameters

Contributions

  • Sakthivel Sivaraman: Teacher-trajectory data generation, cheatsheet summarization methodology, GRPO reward design
  • Prabhjot Singh Rai: Cloud training infrastructure (Ray Train, GCS checkpointing), cheatsheet summarization support, evaluation pipeline, full-parameter SFT on H100, GRPO implementation, model serving

References

  1. Qwen Team. "Qwen2.5 Technical Report." 2024.
  2. Shao et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." 2024.
  3. Yao et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023.
  4. Lightman et al. "Let's Verify Step by Step." ICLR 2024.