LeVanLoi'log, ⌚ 2025-03-25
***
A Review of DeepSeek Models
Tác giả: Chengen Wang (University of Texas at Dallas - chengen.wang@utdallas.edu), Murat Kantarcioglu (Virginia Tech - muratk@vt.edu)
DeepSeek-V3 and DeepSeek-R1 are leading open-source Large Language Models (LLMs) for general-purpose tasks and reasoning, achieving performance comparable to state-of-the-art closed- source models from companies like OpenAI and Anthropic—while requiring only a fraction of their training costs. Understanding the key innovative techniques behind DeepSeek’s success is crucial for advancing LLM research. In this paper, we review the core techniques driving the remarkable effectiveness and efficiency of these models, including refinements to the transformer architecture, innovations such as Multi-Head Latent Attention and Mixture of Experts, Multi-Token Prediction, the co-design of algorithms, frameworks, and hardware, the Group Relative Policy Optimization algorithm, post-training with pure reinforcement learning and iterative training alternating between supervised fine-tuning and reinforcement learning. Additionally, we identify several open questions and highlight potential research opportunities in this rapidly advancing field.
This paper provides an in-depth review of the cutting-edge techniques behind DeepSeek's open-source LLMs—DeepSeek-V3 and DeepSeek-R1. These models achieve state-of-the-art performance with significantly lower resource requirements compared to proprietary counterparts. Key highlights include:
-
Multi-Head Latent Attention (MLA) – Introduces efficient attention by compressing keys and values into a latent vector, dramatically reducing memory consumption for long-context tasks without sacrificing performance. MLA employs low-rank compression and decoupled Rotary Position Embeddings, outperforming standard multi-head attention.
-
Advanced Mixture of Experts (MoE) – Incorporates fine-grained expert segmentation and dedicated shared experts, significantly enhancing combinational flexibility. An innovative load-balancing strategy further optimizes computational efficiency and model performance.
-
Multi-Token Prediction (MTP) – Enhances training efficiency by predicting multiple subsequent tokens simultaneously. Although effective, the additional training overhead warrants further optimization.
-
Algorithm-Hardware Co-design – Presents engineering advancements like DualPipe scheduling, an algorithm designed to eliminate pipeline bubbles, and FP8 mixed-precision training, maximizing computational efficiency and reducing training resources.
-
Group Relative Policy Optimization (GRPO) – Offers a streamlined RL algorithm eliminating value function approximation from PPO, directly estimating advantages from grouped outputs, drastically reducing GPU memory usage.
-
Post-Training Reinforcement Learning – Demonstrates pure RL's capability in DeepSeek-R1-Zero, which learns advanced reasoning without supervised fine-tuning. DeepSeek-R1 further improves this approach via iterative cold-start fine-tuning, rejection sampling, and RL alignment to enhance reasoning quality and language consistency.
Paper