DeepSeek-R1: A New Era in Deep Thinking (including Mathematical reasoning) Through Reinforcement Learning

Jan 21, 2025

Introduction

DeepSeek AI has recently released DeepSeek-R1, a breakthrough LLM capable of exceptionally strong performance on complex mathematical and logical reasoning tasks. Using a novel multi-stage training approach and reinforcement learning, it has begun to match the capabilities of leading systems like OpenAI’s o1-1217. In this post, we’ll explore how DeepSeek-R1 was developed, the key innovations that simplify the training process, and how it compares with other notable initiatives in AI-driven mathematics.

1. DeepSeek-R1-Zero: Learning From Scratch

DeepSeek-R1 builds upon an earlier experiment called DeepSeek-R1-Zero, where the team used only reinforcement learning without any supervised data. They employed a technique called Group Relative Policy Optimization (GRPO), chosen for its efficiency over more conventional algorithms like Proximal Policy Optimization (PPO).

Despite having no human-labeled training data, DeepSeek-R1-Zero rapidly acquired solid reasoning abilities, offering compelling evidence that reinforcement learning alone can teach AI complex mathematical concepts and problem-solving methods from scratch.

2. DeepSeek-R1: A Multi-Stage Training Approach

Following the success of DeepSeek-R1-Zero, the DeepSeek team adopted a small supervised dataset and designed a four-stage training pipeline for DeepSeek-R1:

Base to Supervised Fine-Tuning (SFT):
The base model is fine-tuned using a small, high-quality dataset of long, chain-of-thought examples.
Reinforcement Learning for Reasoning:
Similar reinforcement learning processes as in DeepSeek-R1-Zero are applied, focusing the model on reasoning-intensive tasks.
Rejection Sampling and SFT:
The model generates a large synthetic dataset via rejection sampling, which is then used for additional fine-tuning. This stage also includes data from diverse domains like writing and role-playing.
Reinforcement Learning for Helpfulness:
A final RL stage further refines the model’s helpfulness, correctness, and overall reasoning capabilities.

**Re-evaluation of Initial Approach:** The "aha moment" is characterized by the model's ability to **re-evaluate its initial approach to a problem**

Through these steps, DeepSeek-R1 achieves robust results not only in mathematical domains but also in coding, science, and logic reasoning.

3. Simplifying Key Training Techniques

Alongside the multi-stage framework, DeepSeek-R1 and other researchers (like those behind the Kimi project) have highlighted three key simplifications that make training modern AI models more efficient:

Linearizing Thought Processes (Moving Away from MCTS):
Traditional AI, especially in strategy games like Go or chess, relies on Monte Carlo Tree Search (MCTS). DeepSeek-R1 leverages an autoregressive “Chain of Thought” approach, reducing the complexity of exhaustive tree searches. This is akin to a student reasoning step-by-step instead of branching out dozens of possible moves at once.
Eliminating the Need for a Separate Value Network:
Classical RL often maintains two networks: one for policy (decisions) and another for value (advantages/disadvantages). DeepSeek-R1 and similar models find that a single supercharged policy network or simple evaluation methods can replace a separate value network—lowering resource requirements and simplifying training.
Focusing on End Results Rather Than Dense Rewards:
Instead of giving small rewards after each minor step (dense reward), DeepSeek-R1 looks at overall outcomes—correct or incorrect, successful or not. This minimal-reward approach, inspired by AlphaZero’s focus on win/loss, helps the model learn “ways of thinking” more naturally, without burdensome step-by-step supervision.
Increased Thinking Time: As training progresses, DeepSeek-R1-Zero naturally learns to allocate more 'thinking time' to a problem by generating longer sequences of reasoning tokens. This is not externally imposed, but an emergent behavior of the model. This means the model begins to explore and refine its thought processes in greater depth.

4. Performance and Benchmarks

DeepSeek-R1’s performance on recognized benchmarks has matched or exceeded state-of-the-art models:

Education-oriented Knowledge: MMLU, MMLU-Pro, and GPQA Diamond
Math & Coding: MATH-500, LiveCodeBench, and Codeforces
- DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it achieves an impressive 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models.
- DeepSeek-R1 demonstrates expert-level coding skills, achieving a 2,029 Elo rating on Codeforces, outperforming 96.3% of human participants.

Compared to its predecessor, DeepSeek-V3, DeepSeek-R1 shows measurable improvements in handling more intricate reasoning tasks, underscoring the success of its combined supervised and reinforcement learning strategy.

5. Bonus: Comparisons With AlphaGeometry

Another paper that I really like is AlphaGeometry that is able to solve Olympiad-level Mathematical problems. Even though it was published 1 year earlier than DeepSeek-R1, and its focus is much narrower (AlphaGeometry is a specialized theorem prover designed specifically for Euclidean plane geometry), its reasoning capability is still commendable; while DeepSeek-R1 is a versatile, general-purpose system.

AlphaGeometry’s training pipeline relies on synthetic data generation. It samples geometry premises, uses a symbolic engine to prove them, and then trains a language model on the synthetic proofs.In short, it uses a neuro-symbolic approach, combining a neural language model with a symbolic deduction engine

In contrast, DeepSeek-R1’s power spans multiple domains—not just geometry—thanks to its multi-stage training process that integrates small supervised data with large-scale reinforcement learning. Both approaches provide valuable insights: AlphaGeometry highlights how specialized synthetic data can surpass human-level performance in niche domains, while DeepSeek-R1’s broader scope offers strong mathematical reasoning across a variety of tasks.

Conclusion

DeepSeek-R1 marks a notable milestone in AI-driven mathematical reasoning, blending reinforcement learning and carefully staged training to produce a model that excels in diverse mathematical and logical domains. By streamlining its training approach— abstaining from MCTS, avoiding separate value networks, and focusing on end results—DeepSeek-R1 showcases the power of simpler yet highly principled RL methods.

As research continues to expand, DeepSeek-R1 paves the way toward truly general AI systems that can handle every corner of mathematics with nuance and precision. With its blend of robust performance, efficient training, and multi-domain flexibility, DeepSeek-R1 is a promising contender sure to inspire the next wave of breakthroughs in AI-based reasoning.

Wanna build a model that is capable of step-by-step thinking by yourself? You can combine the ReAct framework with mechanistic interpretability techniques, and it could lead to the development of another customized language model capable of step-by-step thinking and deep reasoning.

While it may not match the performance of DeepSeek or o1, such a model could still be highly effective in targeted domains and offer greater interpretability and trustworthiness.

Thanks for reading NeuroPurrfect AI! This post is public so feel free to share it.