Zhipu has introduced the GLM-5.1 High-Speed API, achieving a groundbreaking output speed of 400 tokens per second, marking a new global benchmark for large model interfaces. This high-speed API, available to select enterprise clients, is powered by a high-performance inference engine developed in collaboration with the TileRT team. The engine optimizes GPU scheduling by compiling models into persistent Engine Kernels, significantly reducing latency.
In multi-GPU environments, the TileRT system enhances efficiency by specializing GPU nodes in an 8-GPU NVL topology, improving attention layer computations and inter-GPU communication. Zhipu plans to further optimize FP8 inference and extend context capabilities to support low-latency applications such as AI programming and real-time interactions.
Zhipu Unveils GLM-5.1 API with Record 400 Tokens/s Output
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
