DeepSeek V4 Integrates Expert Models with On-Policy Distilla

DeepSeek V4 has revamped its training methodology by replacing the mixed reinforcement learning phase from V3.2 with On-Policy Distillation (OPD). This new approach involves two key steps: first, domain-specific expert models are individually trained on the V3.2 pipeline, focusing on areas like mathematics, coding, and instruction following. These experts are fine-tuned and trained with GRPO for reinforcement learning. Second, OPD distills the capabilities of over ten experts into a unified model, using reverse KL divergence to align output distributions and merge weights without capability conflicts. Additionally, DeepSeek V4 introduces the Generative Reward Model (GRM) for tasks that are challenging to validate with rules. Instead of traditional scalar reward models, GRM uses rubric-guided reinforcement learning data, allowing the actor network to generate and evaluate outputs simultaneously. This method achieves generalization to complex tasks with minimal diverse human annotations.