Lun Wang, a researcher at Google DeepMind, has criticized current AI evaluation systems, describing them as a major bottleneck in the industry. Wang argues that existing frameworks are outdated, only capable of assessing current model capabilities without predicting future developments. He warns that these systems fail to detect when models learn new, unforeseen behaviors, posing significant risks if models withhold critical information while remaining factually correct. Wang emphasizes the need for dynamic evaluation systems that evolve alongside AI models, suggesting that AI should generate its own test questions to probe other systems' limits.