LLMs Fine-Tuning Mathematical Reasoning Self-Supervised Chain-of-Thought

Enhancing Mathematical Reasoning in LLMs via Self-Supervised Fine-Tuning

Fine-tuned Qwen 2.5-32B on curated math problems using a novel "Wait" token technique to extend reasoning chains. Self-supervised Chain-of-Thought generation enables the model to produce its own training data iteratively, dramatically improving multi-step mathematical reasoning without massive human-annotated datasets.

56.7% AIME 2024

32B Parameters

1,000 Problems

GCR Google Cloud Run Deployed

// Problem & Motivation

Why LLMs Fail at Math

Large Language Models tend to rush through mathematical problems, committing errors in multi-step reasoning that compound at each stage. When faced with a competition-level problem requiring 8-12 logical steps, even frontier models frequently skip intermediate reasoning, make arithmetic mistakes, or lose track of constraints established earlier in the solution chain.

Standard fine-tuning approaches don't address the root cause: they teach the model what to say, but not how long to think. The model never learns to slow down, revisit assumptions, or extend its internal reasoning before committing to an answer. We need a mechanism that forces the model to deliberate -- to pause and think step-by-step more carefully before arriving at a conclusion.

Our approach introduces the "Wait" token technique: injecting special tokens into the reasoning chain that signal the model to extend its thinking, combined with a self-supervised loop where the model generates its own Chain-of-Thought training data. This creates a virtuous cycle of increasingly deliberate mathematical reasoning.

// System Architecture

Self-Supervised Fine-Tuning Loop

The architecture forms a closed loop: the model generates extended reasoning chains, which are filtered and used as training data for the next iteration. The "Wait" token injection forces deeper reasoning at each cycle.

Problem → Tokenize → Reason → Extend → Self-Train → Deploy

Problem Input

📜

Math Problem Competition-grade

›

🔠

Tokenizer BPE + Math

›

🤖

Qwen 2.5-32B Base Model

“Wait” Token Engine

⏰

“Wait” Inject Pause & Think

›

🔗

Extended CoT 11.8 avg steps

›

✅

Solution Answer + Trace

Self-Supervised Training Loop

⚙️

CoT Generator Self-Supervised

›

🔍

Filter & Curate Quality Gate

›

🔧

Fine-Tuning 6 iterations

›

🔄

Back to Model Improved

↻ Iterate until convergence (6 rounds)

Evaluation

📊

AIME 2024 56.7%

›

📈

MATH 500 68.9%

Cloud Deployment

🚀

vLLM Serving

›

☁️

Cloud Run <2s latency

// Methodology

Training Pipeline

A six-stage pipeline that combines careful data curation with a novel self-supervised training loop. The "Wait" token technique is the key innovation that enables progressively deeper reasoning across iterations.

Dataset Curation

Assembled 1,000 mathematical problems spanning algebra, geometry, number theory, combinatorics, and probability across multiple difficulty levels from competition-grade sources.

"Wait" Token Technique

Injected special Wait tokens into the generation process, forcing the model to pause and extend its reasoning chain before committing to the next logical step.

Self-Supervised CoT Generation

The model generates its own Chain-of-Thought reasoning as training data. Correct solutions with extended reasoning are filtered and retained for the next training round.

Iterative Fine-Tuning

Multiple rounds of fine-tuning using self-generated reasoning data. Each iteration produces deeper, more deliberate reasoning chains with fewer logical errors.

Benchmark Evaluation

Evaluated on AIME 2024 (competition-level) and MATH 500 (broad coverage) benchmarks at each iteration to track progress and prevent overfitting.

Cloud Deployment

Optimized the fine-tuned model for production inference on Google Cloud Run with vLLM serving, achieving sub-2-second response times for real-time applications.

// Performance Analysis

Interactive Results

Detailed performance breakdowns across training iterations, mathematical domains, wait-token ablations, and model comparisons. All charts support zoom and pan for closer inspection.

AIME 2024 Accuracy Over Training Iterations

Accuracy improves steadily with each self-supervised iteration

Line Chart

Scroll to zoom · Drag to pan

MATH 500 Accuracy by Topic

Baseline vs Fine-Tuned performance across mathematical domains

Bar Chart

Scroll to zoom · Drag to pan

Effect of Wait Tokens on Reasoning Depth

Accuracy vs number of Wait tokens with reasoning chain length

Dual Axis

Scroll to zoom · Drag to pan

Model Comparison on AIME 2024

Our fine-tuned model outperforms frontier models on competition math

Horizontal Bar

Scroll to zoom · Drag to pan

// Method Comparison

Approach Comparison

Comparing our Wait + Self-Supervised method against baseline and alternative fine-tuning strategies across key metrics.

Method	AIME 2024	MATH 500	Avg Reasoning Steps	Training Data Needed
Base Qwen 2.5-32B	30.0%	42.3%	4.2	N/A
+ Standard Fine-Tuning	39.5%	51.8%	5.1	10,000+ annotated
+ CoT Fine-Tuning	46.2%	58.4%	7.3	5,000+ CoT pairs
Our Method (Wait + Self-Supervised)	56.7%	68.9%	11.8	1,000 problems only

// Results

Key Results

56.7%

AIME 2024

Competition-level mathematical reasoning benchmark accuracy

+89%

vs Baseline

Relative improvement over the base Qwen 2.5-32B model

1,000

Training Problems

Only 1,000 curated problems needed for the entire pipeline

<2s

Inference

Sub-2-second response time on Google Cloud Run with vLLM

// Key Takeaways

What We Learned

⏰

"Wait" Token Innovation

Injecting Wait tokens forces the model to extend its reasoning chain before committing to an answer. This simple technique mimics human deliberation and leads to dramatically fewer logical errors in multi-step problems.

🔁

Self-Supervised Learning Loop

The model generates its own training data through extended reasoning, creating a virtuous cycle. Each iteration produces higher-quality Chain-of-Thought examples, eliminating the need for expensive human annotation at scale.

📈

Efficient Data Usage

With only 1,000 seed problems, our method achieves results comparable to approaches requiring 10x more annotated data. The self-supervised loop amplifies a small dataset into a powerful training signal through iterative refinement.

// Tech Stack

Technologies Used

Core Stack

Qwen 2.5-32B PyTorch Transformers PEFT Google Cloud Run vLLM Weights & Biases

// Client Fit

Business Impact and Delivery Scope

Problem Solved

Reasoning-heavy workflows fail when base models cannot sustain deep multi-step inference reliably.

What I Deliver

Reasoning-focused fine-tuning pipeline with controlled chain depth, eval tracking, and serving architecture.

Expected Impact

Stronger solution quality on complex tasks and consistent behavior under longer reasoning sequences.

// Work With Me

Hire Me for Reasoning-Focused LLM Tuning

I can build model tuning and eval programs for domains requiring precise, multi-step reasoning outputs.

MVP Delivery

Task-specific reasoning benchmark and fine-tuned model candidate for one priority workflow.

Production Hardening

Ablation testing, safety checks, and inference cost controls before broad rollout.

Advisory + Build

Model selection and training strategy for teams scaling reasoning capabilities.

Start Project Inquiry

Enhancing Mathematical Reasoning in LLMs via Self-Supervised Fine-Tuning

Why LLMs Fail at Math

Self-Supervised Fine-Tuning Loop

Math Problem

Qwen 2.5-32B

“Wait” Token

Extended Reasoning

CoT Generator

AIME 2024

Google Cloud Run

Training Pipeline

Dataset Curation

"Wait" Token Technique

Self-Supervised CoT Generation

Iterative Fine-Tuning

Benchmark Evaluation

Cloud Deployment

Interactive Results

AIME 2024 Accuracy Over Training Iterations

MATH 500 Accuracy by Topic

Effect of Wait Tokens on Reasoning Depth

Model Comparison on AIME 2024

Approach Comparison

Key Results

What We Learned

"Wait" Token Innovation

Self-Supervised Learning Loop

Efficient Data Usage

Technologies Used

Business Impact and Delivery Scope

Problem Solved

What I Deliver

Expected Impact

Hire Me for Reasoning-Focused LLM Tuning

MVP Delivery

Production Hardening

Advisory + Build

Other Projects

Real-Time Multi-Sensor Fusion

Instruction-Tuned Multimodal LLM

Knowledge-Augmented Reasoning Engine

Multimodal Emotion Recognition

Audio-Visual Pedestrian Awareness