DeepSeek V4's technical specifications have been revealed, showcasing a model with 1.6 trillion parameters and a sophisticated architecture. Princeton PhD student Yifan Zhang disclosed these details, highlighting the model's use of DSA2, which integrates DeepSeek Sparse Attention (DSA) and the new Native Sparse Attention (NSA). The model features a head dimension of 512, Sparse MQA, and SWA, with a MoE layer comprising 384 experts, six of which are activated per token. Additionally, a lightweight variant, V4-Lite, with 285 billion parameters, was introduced. Training specifics include the Muon optimizer, a pretraining context length of 32K, and a final context length of 1M. The model is designed for text-only applications. Despite these revelations, DeepSeek has not commented on the information shared by Zhang, who is not affiliated with the company.