A recent paper by AMD and Pennsylvania State University reveals that the instability in FP4 training is due to structural scaling errors, not insufficient randomness. The study, which successfully pre-trained the Llama 3.1-8B model on AMD's Instinct MI355X GPU using the MXFP4 format, achieved a 9–10% speedup over FP8 with only an 8–9% increase in token overhead. This marks the first complete experiment of large model pre-training on native FP4 hardware.
The research highlights that the instability arises from the accumulation of structural errors along sensitive gradient paths, particularly during weight gradient computations. Traditional methods that introduced randomness failed to stabilize training, whereas deterministic Hadamard rotation effectively reduced token overhead and maintained convergence quality close to FP8. This breakthrough suggests that FP4 can be viable for training, potentially doubling the training compute resources on existing hardware.
AMD Paper Identifies Structural Errors as Cause of FP4 Training Instability
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
