MetaEra Boosts Kimi's Inference Speed with FlashKDA

MetaEra has released FlashKDA, an open-source toolset designed to accelerate model inference on NVIDIA Hopper-series GPUs, such as the H100 and H20. FlashKDA, available on GitHub under the MIT license, is tailored for KDA, a novel attention mechanism introduced by Moonshot AI. This mechanism, part of the Kimi Linear model architecture, alternates between KDA and traditional attention layers to optimize computational efficiency. FlashKDA has been rewritten using NVIDIA's CUTLASS library to maximize performance on Hopper GPUs, achieving 1.7x to 2.2x faster forward inference compared to its previous Triton implementation. The tool is particularly effective in scenarios with variable input lengths and batched processing. However, it currently supports only the forward pass, requiring the original Triton version for training. FlashKDA requires Hopper or newer GPUs, CUDA 12.9+, and PyTorch 2.4+, and has been integrated into the flash-linear-attention repository, allowing users to switch with a simple configuration change.

You may also like