Google's research team has introduced the Vision Banana model, which outperforms specialized visual understanding models in several key areas. By applying lightweight instruction tuning to their image generation model, Nano Banana Pro, the team transformed it into a versatile visual understanding tool. This model uniformly parameterizes the output of visual tasks as RGB images, enabling tasks like segmentation and depth estimation through image generation without task-specific architectures.
In evaluations, Vision Banana excelled in semantic segmentation on the Cityscapes dataset, outperforming the SAM 3 model by 4.7 percentage points. It also surpassed SAM 3 in referential expression segmentation but lagged in instance segmentation. For 3D tasks, Vision Banana achieved an average accuracy of 0.929 in metric depth estimation, exceeding the Depth Anything V3 model, despite being trained only on synthetic data. The model also set new benchmarks in surface normal estimation. The research highlights that image generation pretraining is crucial for developing internal representations necessary for visual understanding, akin to text generation pretraining in language models.
Google's Vision Banana Model Surpasses Specialized Visual Models
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
