LLMs Fine-Tuning Mathematical Reasoning Self-Supervised Chain-of-Thought

Enhancing Mathematical Reasoning in LLMs via Self-Supervised Fine-Tuning

Fine-tuned Qwen 2.5-32B on curated math problems using a novel "Wait" token technique to extend reasoning chains. Self-supervised Chain-of-Thought generation enables the model to produce its own training data iteratively, dramatically improving multi-step mathematical reasoning without massive human-annotated datasets.

56.7% AIME 2024
32B Parameters
1,000 Problems
GCR Google Cloud Run Deployed

Why LLMs Fail at Math

Large Language Models tend to rush through mathematical problems, committing errors in multi-step reasoning that compound at each stage. When faced with a competition-level problem requiring 8-12 logical steps, even frontier models frequently skip intermediate reasoning, make arithmetic mistakes, or lose track of constraints established earlier in the solution chain.

Standard fine-tuning approaches don't address the root cause: they teach the model what to say, but not how long to think. The model never learns to slow down, revisit assumptions, or extend its internal reasoning before committing to an answer. We need a mechanism that forces the model to deliberate -- to pause and think step-by-step more carefully before arriving at a conclusion.

Our approach introduces the "Wait" token technique: injecting special tokens into the reasoning chain that signal the model to extend its thinking, combined with a self-supervised loop where the model generates its own Chain-of-Thought training data. This creates a virtuous cycle of increasingly deliberate mathematical reasoning.

Self-Supervised Fine-Tuning Loop

The architecture forms a closed loop: the model generates extended reasoning chains, which are filtered and used as training data for the next iteration. The "Wait" token injection forces deeper reasoning at each cycle.

Problem → Tokenize → Reason → Extend → Self-Train → Deploy
Problem Input
📜
Math Problem Competition-grade
Math Problem

1,000 curated competition-grade problems spanning algebra, geometry, number theory, combinatorics from AIME and MATH benchmark sources.

1000 problemsAIME
🔠
Tokenizer BPE + Math
🤖
Qwen 2.5-32B Base Model
Qwen 2.5-32B

32-billion parameter decoder-only transformer. Generates initial reasoning tokens autoregressively with multi-head attention over the full problem context.

32BDecoder
“Wait” Token Engine
“Wait” Inject Pause & Think
“Wait” Token

Special tokens injected mid-generation force the model to pause and extend its reasoning chain before committing to the next logical step. Mimics human deliberation.

NovelDeliberation
🔗
Extended CoT 11.8 avg steps
Extended Reasoning

Longer Chain-of-Thought with 11.8 avg steps (vs 4.2 baseline). Each step explicitly references prior conclusions and checks constraints.

11.8 stepsCoT
Solution Answer + Trace
Self-Supervised Training Loop
⚙️
CoT Generator Self-Supervised
CoT Generator

Model generates its own Chain-of-Thought training data. Extended reasoning chains from correct solutions become training examples for next iteration.

Self-SupData Gen
🔍
Filter & Curate Quality Gate
🔧
Fine-Tuning 6 iterations
🔄
Back to Model Improved
Iterate until convergence (6 rounds)
Evaluation
📊
AIME 2024 56.7%
AIME 2024

30 competition-level problems. Score improved from 30% baseline to 56.7% — a +89% relative improvement.

56.7%+89%
📈
MATH 500 68.9%
Cloud Deployment
🚀
vLLM Serving
☁️
Cloud Run <2s latency
Google Cloud Run

Serverless deployment with auto-scaling. Sub-2-second response time for real-time mathematical tutoring applications.

GCR<2s

Training Pipeline

A six-stage pipeline that combines careful data curation with a novel self-supervised training loop. The "Wait" token technique is the key innovation that enables progressively deeper reasoning across iterations.

Dataset Curation

Assembled 1,000 mathematical problems spanning algebra, geometry, number theory, combinatorics, and probability across multiple difficulty levels from competition-grade sources.

"Wait" Token Technique

Injected special Wait tokens into the generation process, forcing the model to pause and extend its reasoning chain before committing to the next logical step.

Self-Supervised CoT Generation

The model generates its own Chain-of-Thought reasoning as training data. Correct solutions with extended reasoning are filtered and retained for the next training round.

Iterative Fine-Tuning

Multiple rounds of fine-tuning using self-generated reasoning data. Each iteration produces deeper, more deliberate reasoning chains with fewer logical errors.

Benchmark Evaluation

Evaluated on AIME 2024 (competition-level) and MATH 500 (broad coverage) benchmarks at each iteration to track progress and prevent overfitting.

Cloud Deployment

Optimized the fine-tuned model for production inference on Google Cloud Run with vLLM serving, achieving sub-2-second response times for real-time applications.

Interactive Results

Detailed performance breakdowns across training iterations, mathematical domains, wait-token ablations, and model comparisons. All charts support zoom and pan for closer inspection.

AIME 2024 Accuracy Over Training Iterations

Accuracy improves steadily with each self-supervised iteration

Line Chart

Scroll to zoom · Drag to pan

MATH 500 Accuracy by Topic

Baseline vs Fine-Tuned performance across mathematical domains

Bar Chart

Scroll to zoom · Drag to pan

Effect of Wait Tokens on Reasoning Depth

Accuracy vs number of Wait tokens with reasoning chain length

Dual Axis

Scroll to zoom · Drag to pan

Model Comparison on AIME 2024

Our fine-tuned model outperforms frontier models on competition math

Horizontal Bar

Scroll to zoom · Drag to pan

Approach Comparison

Comparing our Wait + Self-Supervised method against baseline and alternative fine-tuning strategies across key metrics.

Method AIME 2024 MATH 500 Avg Reasoning Steps Training Data Needed
Base Qwen 2.5-32B 30.0% 42.3% 4.2 N/A
+ Standard Fine-Tuning 39.5% 51.8% 5.1 10,000+ annotated
+ CoT Fine-Tuning 46.2% 58.4% 7.3 5,000+ CoT pairs
Our Method (Wait + Self-Supervised) 56.7% 68.9% 11.8 1,000 problems only

Key Results

56.7%
AIME 2024

Competition-level mathematical reasoning benchmark accuracy

+89%
vs Baseline

Relative improvement over the base Qwen 2.5-32B model

1,000
Training Problems

Only 1,000 curated problems needed for the entire pipeline

<2s
Inference

Sub-2-second response time on Google Cloud Run with vLLM

What We Learned

"Wait" Token Innovation

Injecting Wait tokens forces the model to extend its reasoning chain before committing to an answer. This simple technique mimics human deliberation and leads to dramatically fewer logical errors in multi-step problems.

🔁

Self-Supervised Learning Loop

The model generates its own training data through extended reasoning, creating a virtuous cycle. Each iteration produces higher-quality Chain-of-Thought examples, eliminating the need for expensive human annotation at scale.

📈

Efficient Data Usage

With only 1,000 seed problems, our method achieves results comparable to approaches requiring 10x more annotated data. The self-supervised loop amplifies a small dataset into a powerful training signal through iterative refinement.

Technologies Used

Core Stack
Qwen 2.5-32B PyTorch Transformers PEFT Google Cloud Run vLLM Weights & Biases

Business Impact and Delivery Scope

Problem Solved

Reasoning-heavy workflows fail when base models cannot sustain deep multi-step inference reliably.

What I Deliver

Reasoning-focused fine-tuning pipeline with controlled chain depth, eval tracking, and serving architecture.

Expected Impact

Stronger solution quality on complex tasks and consistent behavior under longer reasoning sequences.

Hire Me for Reasoning-Focused LLM Tuning

I can build model tuning and eval programs for domains requiring precise, multi-step reasoning outputs.

MVP Delivery

Task-specific reasoning benchmark and fine-tuned model candidate for one priority workflow.

Production Hardening

Ablation testing, safety checks, and inference cost controls before broad rollout.

Advisory + Build

Model selection and training strategy for teams scaling reasoning capabilities.

Other Projects

Real-Time Multi-Sensor Fusion

Unified perception pipeline fusing camera, LiDAR, and radar for autonomous driving with sub-50ms latency.

Instruction-Tuned Multimodal LLM

Vision Transformer + LLM integration for conversational VQA, scene understanding, and multimodal grounding.

Knowledge-Augmented Reasoning Engine

Fine-tuned LLM with Knowledge Graph, PEFT, and RAG pipeline for factual multi-hop question answering.

Multimodal Emotion Recognition

Real-time emotion recognition combining computer vision, speech processing, and NLP for human-robot interaction.

Audio-Visual Pedestrian Awareness

Self-supervised audio-visual fusion for pedestrian detection achieving LiDAR-comparable performance on edge devices.