Xiaomi has announced significant cost reductions for its MiMo-V2.5 series APIs, achieved through innovative algorithmic strategies. Luo Fuli, head of Xiaomi’s large model team, detailed these advancements, highlighting a hybrid attention architecture and hierarchical KV cache optimizations. These techniques have led to a 99% reduction in cache hit costs and an 80% decrease in cache costs, thanks to increased token cache capacity and overlapping cache reads. The MiMo-V2.5-Pro model's efficiency is further enhanced by a 1:7 inter-layer sparsity ratio, allowing it to perform attention computations equivalent to a 10-layer traditional model, despite having 70 layers. This optimization has halved Xiaomi's inference costs, enabling a price reduction without sacrificing profitability. Luo emphasized the importance of strategic cost management over price wars, advocating for sustainable, low-cost inference services to boost demand for intelligent applications.