DeepSeek V4's technical specifications have been revealed, showcasing a model with 1.6 trillion parameters and a sophisticated architecture. Princeton PhD student Yifan Zhang disclosed these details, highlighting the model's use of DSA2, which integrates DeepSeek Sparse Attention (DSA) and the new Native Sparse Attention (NSA). The model features a head dimension of 512, Sparse MQA, and SWA, with a MoE layer comprising 384 experts, six of which are activated per token.
Additionally, a lightweight variant, V4-Lite, with 285 billion parameters, was introduced. Training specifics include the Muon optimizer, a pretraining context length of 32K, and a final context length of 1M. The model is designed for text-only applications. Despite these revelations, DeepSeek has not commented on the information shared by Zhang, who is not affiliated with the company.
DeepSeek V4 Unveiled: 1.6 Trillion Parameters and Advanced Architecture
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
