Meituan's LongCat team has open-sourced the LongCat-Video-Avatar 1.5 framework, featuring an upgraded audio-driven portrait video generation system. The new version replaces Wav2Vec2 with the Whisper-Large audio encoder, enhancing identity consistency and style generalization in long-form videos. The framework now uses an 8-step inference process, improving efficiency and image fidelity.
The framework's improvements include better lip synchronization and facial dynamics, achieved through the Whisper-large-v3 audio encoder. It also enhances temporal stability using multi-segment rolling inference. Evaluations involved 508 image-audio pairs and feedback from 770 evaluators, highlighting advancements over competitors like HeyGen and Kling Avatar 2.0. The framework supports various styles, including anime and animal, and is available under the MIT license for academic use only.
Meituan Releases LongCat-Video-Avatar 1.5 Framework with Enhanced Features
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
