The PyTorch team has optimized the performance of LayerNorm and RMSNorm on NVIDIA H100 and B200 GPUs. Announced on April 8, these improvements aim to achieve near state-of-the-art performance at the kernel level, leveraging torch.compile for automatic fusion. This development is expected to enhance computational efficiency for users employing these GPUs.