Microsoft's Phi-Ground Model Outperforms in Click Accuracy

Microsoft has open-sourced its Phi-Ground model family, designed to enhance AI's ability to determine precise click locations on a screen. The 4-billion-parameter model, when combined with a large model for instruction planning, outperformed OpenAI's Operator and Claude Computer Use in click accuracy on the Showdown benchmark. It ranked first among models under 10 billion parameters across five evaluations, including ScreenSpot-Pro. The development team validated the model using over 40 million data points, discovering that traditional training techniques failed at scale. Instead, they found success by outputting coordinates as ordinary numbers and placing textual instructions before images in the input sequence. Additionally, reinforcement learning, typically used for language tasks, proved effective for visual tasks, enhancing accuracy through contrastive training. The team also addressed challenges with high-resolution screens by scaling down screenshots and using a large white canvas during training, improving performance on complex software like Photoshop.