Shanghai-based AI lab StepFun has outperformed major tech competitors with its StepAudio 2.5 Realtime model, which excelled in all five major voice AI benchmarks from April 2026. The model surpassed GPT Realtime 1.5 and Gemini Live, demonstrating superior capabilities in understanding tone, emotion, and speech rate. Key scores include 80.41 in human evaluation, 86.36 in general dialogue performance, and 84.80 in automotive scenario testing. StepAudio 2.5 Realtime's architecture integrates Automatic Speech Recognition, Text-to-Speech, and real-time dialogue processing into a unified system, reducing latency and enhancing nuance. The model employs persona-specific Reinforcement Learning from Human Feedback, allowing it to maintain consistent character traits. It supports both Chinese and English and is accessible via StepFun's platform API. The model's paralinguistic comprehension score of 82.18 highlights its ability to detect emotional cues, offering significant advancements in voice assistant technology.