Xiaomi's MiMo-V2.5 Model Cuts Costs with Advanced Attention Computation

Xiaomi has announced significant cost reductions for its MiMo-V2.5 series APIs, achieved through innovative algorithmic strategies. Luo Fuli, head of Xiaomi’s large model team, detailed these advancements, highlighting a hybrid attention architecture and hierarchical KV cache optimizations. These techniques have led to a 99% reduction in cache hit costs and an 80% decrease in cache costs, thanks to increased token cache capacity and overlapping cache reads. The MiMo-V2.5-Pro model's efficiency is further enhanced by a 1:7 inter-layer sparsity ratio, allowing it to perform attention computations equivalent to a 10-layer traditional model, despite having 70 layers. This optimization has halved Xiaomi's inference costs, enabling a price reduction without sacrificing profitability. Luo emphasized the importance of strategic cost management over price wars, advocating for sustainable, low-cost inference services to boost demand for intelligent applications.

Source: Show Original

Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.

You may also like