Microsoft has open-sourced its Phi-Ground model family, designed to enhance AI's ability to determine precise click locations on a screen. The 4-billion-parameter model, when combined with a large model for instruction planning, outperformed OpenAI's Operator and Claude Computer Use in click accuracy on the Showdown benchmark. It ranked first among models under 10 billion parameters across five evaluations, including ScreenSpot-Pro.
The development team validated the model using over 40 million data points, discovering that traditional training techniques failed at scale. Instead, they found success by outputting coordinates as ordinary numbers and placing textual instructions before images in the input sequence. Additionally, reinforcement learning, typically used for language tasks, proved effective for visual tasks, enhancing accuracy through contrastive training. The team also addressed challenges with high-resolution screens by scaling down screenshots and using a large white canvas during training, improving performance on complex software like Photoshop.
Microsoft Open-Sources Phi-Ground Model, Surpassing Competitors in Click Accuracy
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
