A new benchmark, CUSP, developed by Stanford University, the University of Oxford, and the Allen Institute for Artificial Intelligence, reveals significant limitations in AI models' ability to predict scientific progress. The evaluation tested leading AI models like GPT-5.4, Claude Sonnet 4.5, and DeepSeek R1, finding that while these models excel in mechanistic reasoning, their accuracy in predicting new scientific discoveries is akin to random guessing.
The CUSP benchmark, which includes 4,760 scientific milestones and 17,429 evaluation tasks, introduces temporal knowledge cutoff constraints to assess true predictive capabilities. Results show that models like GPT-5.4 and Claude S4.5 consistently overestimate breakthrough timelines, with delays ranging from 14 to 26 months. Despite achieving high accuracy in identifying plausible research directions, models struggle with feasibility assessments, achieving only 45% to 52% accuracy. This highlights a significant gap in AI's ability to provide reliable guidance in scientific exploration.
CUSP Benchmark Exposes AI Models' Limitations in Scientific Forecasting
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
