Fine-tuned Qwen 2.5-32B on curated math problems using a novel "Wait" token technique to extend reasoning chains. Self-supervised Chain-of-Thought generation enables the model to produce its own training data iteratively, dramatically improving multi-step mathematical reasoning without massive human-annotated datasets.
Large Language Models tend to rush through mathematical problems, committing errors in multi-step reasoning that compound at each stage. When faced with a competition-level problem requiring 8-12 logical steps, even frontier models frequently skip intermediate reasoning, make arithmetic mistakes, or lose track of constraints established earlier in the solution chain.
Standard fine-tuning approaches don't address the root cause: they teach the model what to say, but not how long to think. The model never learns to slow down, revisit assumptions, or extend its internal reasoning before committing to an answer. We need a mechanism that forces the model to deliberate -- to pause and think step-by-step more carefully before arriving at a conclusion.
Our approach introduces the "Wait" token technique: injecting special tokens into the reasoning chain that signal the model to extend its thinking, combined with a self-supervised loop where the model generates its own Chain-of-Thought training data. This creates a virtuous cycle of increasingly deliberate mathematical reasoning.
The architecture forms a closed loop: the model generates extended reasoning chains, which are filtered and used as training data for the next iteration. The "Wait" token injection forces deeper reasoning at each cycle.
1,000 curated competition-grade problems spanning algebra, geometry, number theory, combinatorics from AIME and MATH benchmark sources.
32-billion parameter decoder-only transformer. Generates initial reasoning tokens autoregressively with multi-head attention over the full problem context.
Special tokens injected mid-generation force the model to pause and extend its reasoning chain before committing to the next logical step. Mimics human deliberation.
Longer Chain-of-Thought with 11.8 avg steps (vs 4.2 baseline). Each step explicitly references prior conclusions and checks constraints.
Model generates its own Chain-of-Thought training data. Extended reasoning chains from correct solutions become training examples for next iteration.
30 competition-level problems. Score improved from 30% baseline to 56.7% — a +89% relative improvement.
Serverless deployment with auto-scaling. Sub-2-second response time for real-time mathematical tutoring applications.
A six-stage pipeline that combines careful data curation with a novel self-supervised training loop. The "Wait" token technique is the key innovation that enables progressively deeper reasoning across iterations.
Assembled 1,000 mathematical problems spanning algebra, geometry, number theory, combinatorics, and probability across multiple difficulty levels from competition-grade sources.
Injected special Wait tokens into the generation process, forcing the model to pause and extend its reasoning chain before committing to the next logical step.
The model generates its own Chain-of-Thought reasoning as training data. Correct solutions with extended reasoning are filtered and retained for the next training round.
Multiple rounds of fine-tuning using self-generated reasoning data. Each iteration produces deeper, more deliberate reasoning chains with fewer logical errors.
Evaluated on AIME 2024 (competition-level) and MATH 500 (broad coverage) benchmarks at each iteration to track progress and prevent overfitting.
Optimized the fine-tuned model for production inference on Google Cloud Run with vLLM serving, achieving sub-2-second response times for real-time applications.
Detailed performance breakdowns across training iterations, mathematical domains, wait-token ablations, and model comparisons. All charts support zoom and pan for closer inspection.
Accuracy improves steadily with each self-supervised iteration
Scroll to zoom · Drag to pan
Baseline vs Fine-Tuned performance across mathematical domains
Scroll to zoom · Drag to pan
Accuracy vs number of Wait tokens with reasoning chain length
Scroll to zoom · Drag to pan
Our fine-tuned model outperforms frontier models on competition math
Scroll to zoom · Drag to pan
Comparing our Wait + Self-Supervised method against baseline and alternative fine-tuning strategies across key metrics.
| Method | AIME 2024 | MATH 500 | Avg Reasoning Steps | Training Data Needed |
|---|---|---|---|---|
| Base Qwen 2.5-32B | 30.0% | 42.3% | 4.2 | N/A |
| + Standard Fine-Tuning | 39.5% | 51.8% | 5.1 | 10,000+ annotated |
| + CoT Fine-Tuning | 46.2% | 58.4% | 7.3 | 5,000+ CoT pairs |
| Our Method (Wait + Self-Supervised) | 56.7% | 68.9% | 11.8 | 1,000 problems only |
Injecting Wait tokens forces the model to extend its reasoning chain before committing to an answer. This simple technique mimics human deliberation and leads to dramatically fewer logical errors in multi-step problems.
The model generates its own training data through extended reasoning, creating a virtuous cycle. Each iteration produces higher-quality Chain-of-Thought examples, eliminating the need for expensive human annotation at scale.
With only 1,000 seed problems, our method achieves results comparable to approaches requiring 10x more annotated data. The self-supervised loop amplifies a small dataset into a powerful training signal through iterative refinement.
Reasoning-heavy workflows fail when base models cannot sustain deep multi-step inference reliably.
Reasoning-focused fine-tuning pipeline with controlled chain depth, eval tracking, and serving architecture.
Stronger solution quality on complex tasks and consistent behavior under longer reasoning sequences.
I can build model tuning and eval programs for domains requiring precise, multi-step reasoning outputs.
Task-specific reasoning benchmark and fine-tuned model candidate for one priority workflow.
Ablation testing, safety checks, and inference cost controls before broad rollout.
Model selection and training strategy for teams scaling reasoning capabilities.
Unified perception pipeline fusing camera, LiDAR, and radar for autonomous driving with sub-50ms latency.
Vision Transformer + LLM integration for conversational VQA, scene understanding, and multimodal grounding.
Fine-tuned LLM with Knowledge Graph, PEFT, and RAG pipeline for factual multi-hop question answering.
Real-time emotion recognition combining computer vision, speech processing, and NLP for human-robot interaction.
Self-supervised audio-visual fusion for pedestrian detection achieving LiDAR-comparable performance on edge devices.