Google's Vision Banana Model Outperforms Specialized Models

Google's research team has introduced the Vision Banana model, which outperforms specialized visual understanding models in several key areas. By applying lightweight instruction tuning to their image generation model, Nano Banana Pro, the team transformed it into a versatile visual understanding tool. This model uniformly parameterizes the output of visual tasks as RGB images, enabling tasks like segmentation and depth estimation through image generation without task-specific architectures. In evaluations, Vision Banana excelled in semantic segmentation on the Cityscapes dataset, outperforming the SAM 3 model by 4.7 percentage points. It also surpassed SAM 3 in referential expression segmentation but lagged in instance segmentation. For 3D tasks, Vision Banana achieved an average accuracy of 0.929 in metric depth estimation, exceeding the Depth Anything V3 model, despite being trained only on synthetic data. The model also set new benchmarks in surface normal estimation. The research highlights that image generation pretraining is crucial for developing internal representations necessary for visual understanding, akin to text generation pretraining in language models.