CODA Optimizes Transformer Training with GEMM-Epilogue

Researchers from MIT, Princeton, Together AI, and Meta have introduced CODA, a new programming abstraction aimed at optimizing Transformer model training. The study, titled "CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs," focuses on reducing the time-consuming memory-intensive operations in Transformer training by leveraging GEMM-epilogue programming. This approach allows for the execution of additional computations during the brief window when matrix multiplication results are still in on-chip registers, thus avoiding unnecessary memory transfers. CODA's framework exposes five composable primitive operations at the epilogue, enabling efficient execution of nearly all operations in a Transformer's forward and backward passes, excluding attention. The study demonstrates significant performance improvements, with CODA achieving up to 1.8 times speedup in backpropagation and 5% to 20% acceleration in full Transformer layer processing. This advancement highlights the potential for AI models to optimize their own training infrastructure through well-designed programming abstractions.

You may also like