Jina AI Unveils v5-omni for Quad-Modal Retrieval

Jina AI has released jina-embeddings-v5-omni, an open-source quad-modal vector model that supports text, images, audio, and video retrieval with minimal parameter cost. The model's innovative architecture allows for the integration of visual and audio encoders by freezing the text-only backbone and fine-tuning only the connection components, which constitute just 0.35% of the total parameters. This approach enables enterprises to upgrade to multi-modal systems without recalculating existing text indexes, significantly reducing GPU memory usage by up to 64% and accelerating training by up to 3.9 times. The v5-omni model, with approximately 1.57 billion parameters, demonstrates performance comparable to larger models like LCO-Embedding-Omni-7B, despite its smaller size. While it still faces challenges in video retrieval tasks, the model offers a cost-effective path for enterprises to expand their retrieval capabilities across multiple modalities, leveraging a strong text backbone to minimize additional costs.