MetaEra has released FlashKDA, an open-source toolset designed to accelerate model inference on NVIDIA Hopper-series GPUs, such as the H100 and H20. FlashKDA, available on GitHub under the MIT license, is tailored for KDA, a novel attention mechanism introduced by Moonshot AI. This mechanism, part of the Kimi Linear model architecture, alternates between KDA and traditional attention layers to optimize computational efficiency.
FlashKDA has been rewritten using NVIDIA's CUTLASS library to maximize performance on Hopper GPUs, achieving 1.7x to 2.2x faster forward inference compared to its previous Triton implementation. The tool is particularly effective in scenarios with variable input lengths and batched processing. However, it currently supports only the forward pass, requiring the original Triton version for training. FlashKDA requires Hopper or newer GPUs, CUDA 12.9+, and PyTorch 2.4+, and has been integrated into the flash-linear-attention repository, allowing users to switch with a simple configuration change.
MetaEra Open-Sources FlashKDA, Enhancing Kimi's Inference Speed by Up to 2.2x
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
