NUS Team Unveils GameWorld Benchmark for AI Evaluation in Browser Games

The National University of Singapore (NUS) has launched GameWorld, a new benchmark aimed at standardizing the evaluation of multimodal large language models (MLLMs) as general agents in video games. GameWorld encompasses 34 browser games and 170 tasks, each with verifiable metrics to objectively assess outcomes. This initiative addresses the limitations of inconsistent input interfaces and manual verification in current evaluations. The NUS team tested two agent interfaces: a "computer-use" agent that outputs keyboard and mouse commands, and a general multimodal agent using semantic parsing. In a large-scale evaluation involving 18 model-interface combinations, results indicated that current AI agents still fall short of human-level gaming abilities. The study highlights challenges such as real-time interaction latency and sensitivity to contextual memory. The research paper and project code are available on Hugging Face and GitHub.

Source: Show Original

Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.