Jina AI has released jina-embeddings-v5-omni, an open-source quad-modal vector model that supports text, images, audio, and video retrieval with minimal parameter cost. The model's innovative architecture allows for the integration of visual and audio encoders by freezing the text-only backbone and fine-tuning only the connection components, which constitute just 0.35% of the total parameters. This approach enables enterprises to upgrade to multi-modal systems without recalculating existing text indexes, significantly reducing GPU memory usage by up to 64% and accelerating training by up to 3.9 times.
The v5-omni model, with approximately 1.57 billion parameters, demonstrates performance comparable to larger models like LCO-Embedding-Omni-7B, despite its smaller size. While it still faces challenges in video retrieval tasks, the model offers a cost-effective path for enterprises to expand their retrieval capabilities across multiple modalities, leveraging a strong text backbone to minimize additional costs.
Jina AI Launches v5-omni for Efficient Quad-Modal Retrieval
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
