Xiaomi has released ControlFoley, an open-source framework designed to enhance video sound effects generation. Unlike traditional AI dubbing models that infer sounds from visuals, ControlFoley offers creators precise control over audio style by generating sounds based on video content and accepting text descriptions or reference audio. This allows for transformations such as converting a knock into a "metal strike" while maintaining synchronization with video visuals.
ControlFoley utilizes a spatiotemporal audiovisual encoder and a "time-timbre decoupling" strategy, achieving state-of-the-art performance on standard video dubbing benchmarks. It competes closely with commercial systems like Kling-Foley in metrics such as semantic alignment and synchronization, though it slightly underperforms in certain KL divergence metrics. The framework's technical report, code, and demo are now publicly accessible.
Xiaomi Open-Sources ControlFoley for Enhanced Video Sound Generation
Disclaimer: The content provided on Phemex News is for informational purposes only. We do not guarantee the quality, accuracy, or completeness of the information sourced from third-party articles. The content on this page does not constitute financial or investment advice. We strongly encourage you to conduct you own research and consult with a qualified financial advisor before making any investment decisions.
